Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations

Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We examine, analyze, and compare four representative creativity measures–perplexity, LLM-as-a-Judge, the Creativity Index (CI; measuring n-gram overlap with web corpora), and syntactic templates (detecting repetition of common part-of-speech patterns)–across the diverse creative domains, such as creative writing, unconventional problem-solving, and research ideation. For each domain, we compile datasets with human-aligned creative and uncreative examples and evaluate each metric’s ability to discriminate between the two sets. Our analyses reveal limited consistency both across domains and metrics, as metrics that distinguish creativity in one domain fail in others (e.g., CI correctly distinguishes in creative writing but fails in problem-solving), and different metrics often disagree on the same data points (e.g., CI suggests one set to be more creative, while perplexity indicates the other set to be more creative.) We highlight key limitations, such as perplexity reflecting fluency rather than novelty; LLM-as-a-Judge producing inconsistent judgments under minor prompt variations and exhibiting bias towards particular labels; CI primarily measuring lexical diversity, with high sensitivity to implementation choices; and syntactic templates being ineffective in settings dominated by formulaic language. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.


💡 Research Summary

This paper conducts a systematic, cross‑domain evaluation of four widely‑used automatic creativity metrics—Perplexity, LLM‑as‑a‑Judge, the Creativity Index (CI), and Syntactic Templates—by comparing their ability to discriminate between human‑labeled “creative” and “uncreative” examples in three distinct domains: creative writing, unconventional problem‑solving, and research ideation.

Dataset Construction
For each domain the authors curate balanced datasets (≈150–513 pairs per class) that reflect human judgments of creativity. Creative writing data combine human‑written movie synopses (creative) with LLM‑generated narratives (uncreative). Problem‑solving data adapt the MacGyver functional‑fixedness set, pairing unconventional solutions (creative) with conventional ones (uncreative). Research ideation data consist of accepted vs. rejected ICLR 2024 papers, controlling for soundness and presentation scores so that novelty is the primary differentiator. Human annotators performed pairwise preference tasks, achieving 73 % agreement, providing a reliable ground truth.

Metrics Evaluated

  1. Perplexity (PPL) – computed with a pretrained GPT‑2 model; lower probability → higher perplexity, interpreted as “surprise”.
  2. Creativity Index (CI) – measures n‑gram overlap with a massive web corpus using the DJSEARCH/Infi‑Gram algorithm; the proportion of words appearing in novel n‑gram contexts (L‑uniqueness) is averaged over L = 5–11.
  3. Syntactic Templates – extracts POS‑tag sequences, reporting three sub‑metrics: compression ratio of POS sequences (CR‑POS), template rate (fraction of texts containing a common template), and templates‑per‑token (TPT).
  4. LLM‑as‑a‑Judge – employs state‑of‑the‑art LLMs (GPT‑4, Claude‑2) with chain‑of‑thought prompting and domain‑specific rubrics (e.g., originality, logic, presentation) to produce a 1‑5 overall creativity score.

Experimental Findings

  • Perplexity shows substantial overlap between creative and uncreative score distributions across all domains, yielding low discriminative power (AUC ≈ 0.55). The metric captures fluency rather than novelty, confirming the authors’ hypothesis that token‑level surprise is insufficient for creativity assessment.

  • Creativity Index performs moderately well in the creative‑writing domain (AUC ≈ 0.78) but fails in problem‑solving and research ideation (AUC ≈ 0.52). Moreover, CI is highly sensitive to implementation choices: varying L from 5‑7 to 5‑11 can double the score, and exact‑match vs. near‑match handling changes results dramatically. This reveals that CI primarily measures lexical overlap, which aligns with artistic prose but not with conceptual innovation.

  • Syntactic Templates detect structural repetition effectively when texts exhibit diverse syntactic patterns, yet both problem‑solving and research‑ideation datasets consist largely of formulaic language, leading to negligible discriminative ability (AUC ≈ 0.50). Even in creative writing, the metric captures only surface‑level syntactic novelty, missing deeper semantic creativity.

  • LLM‑as‑a‑Judge yields the highest average correlation with human labels in creative writing (Spearman ≈ 0.62) but suffers from severe prompt‑sensitivity and model bias. Re‑evaluating the same instance three times results in only 40 % consistency. Certain prompts bias the model toward “creative” labels, and different LLM families produce divergent scores. Potential data contamination (training data containing evaluation examples) further clouds interpretability.

Cross‑Metric Consistency
The four metrics often disagree on the same instance; for example, CI may rank a text as more creative while perplexity ranks the opposite. No metric consistently outperforms the others across all three domains, underscoring the multidimensional nature of creativity (novelty + usefulness) that cannot be captured by a single linguistic proxy.

Limitations and Future Directions
The authors argue that current automatic metrics lack a unified representation of conceptual novelty. They propose three research avenues:

  1. Semantic‑level novelty measures – embedding‑based distance, graph‑based concept networks, or knowledge‑graph divergence to capture idea‑level originality.
  2. Hybrid evaluation frameworks – combining human judgments with model‑based scores, possibly via calibrated ensemble methods, to improve reliability and reduce bias.
  3. Domain‑specific metric standardization – releasing benchmark datasets, open‑source implementations, and clear evaluation protocols to foster reproducibility and facilitate comparative studies.

Additionally, they suggest systematic studies of LLM‑as‑a‑Judge prompting strategies, temperature settings, and model selection to enhance consistency.

Conclusion
The paper provides a thorough, data‑driven critique of four popular automatic creativity metrics, revealing that each captures only a narrow facet of creativity and that their performance is highly domain‑dependent. The findings highlight the urgent need for more robust, generalizable evaluation frameworks that align with human perceptions of novelty and usefulness. By exposing the strengths and weaknesses of existing tools, this work lays a solid foundation for the next generation of creativity assessment methods in the era of large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment