Large Language Model and Formal Concept Analysis: a comparative study for Topic Modeling
Topic modeling is a research field finding increasing applications: historically from document retrieving, to sentiment analysis and text summarization. Large Language Models (LLM) are currently a major trend in text processing, but few works study their usefulness for this task. Formal Concept Analysis (FCA) has recently been presented as a candidate for topic modeling, but no real applied case study has been conducted. In this work, we compare LLM and FCA to better understand their strengths and weakneses in the topic modeling field. FCA is evaluated through the CREA pipeline used in past experiments on topic modeling and visualization, whereas GPT-5 is used for the LLM. A strategy based on three prompts is applied with GPT-5 in a zero-shot setup: topic generation from document batches, merging of batch results into final topics, and topic labeling. A first experiment reuses the teaching materials previously used to evaluate CREA, while a second experiment analyzes 40 research articles in information systems to compare the extracted topics with the underling subfields.
💡 Research Summary
This paper presents a systematic comparative study of two fundamentally different approaches to topic modeling: a state‑of‑the‑art large language model (LLM), specifically OpenAI’s GPT‑5, and a formal concept analysis (FCA) pipeline known as CREA. The authors aim to understand the relative strengths, weaknesses, and practical applicability of each method by applying them to two domain‑specific corpora: (1) a collection of eight PHP programming courses originally used to evaluate CREA, and (2) a set of forty peer‑reviewed information‑systems research articles (both abstracts and full texts).
The GPT‑5 workflow follows a three‑prompt zero‑shot protocol. First, documents are split into manageable batches (to respect the model’s context‑window limits) and a “topic generation” prompt is issued for each batch. Second, a “merge” prompt consolidates the batch‑level topics, removes duplicates, and produces a unified list. Third, a “labeling” prompt asks the model to assign concise, human‑readable names to each topic. Minimal preprocessing (removal of non‑printable characters) is performed to avoid contaminating the model’s output, under the assumption that GPT‑5 can handle stop‑words and basic tokenization internally.
The FCA side relies on the CREA pipeline, which consists of four main phases: (i) semantic preprocessing (lemmatization and part‑of‑speech filtering using TreeTagger), (ii) semantic disambiguation via BabelFy (exact‑matching mode, top‑scored candidate selection), (iii) formal concept analysis where a binary object‑attribute matrix is built and a closure operator extracts maximal object‑attribute pairs, and (iv) hierarchical agglomerative clustering (HAC) that groups the resulting concepts into a user‑defined number of clusters (k). The authors explore four binarization strategies (Direct, Low, Medium, High) controlled by a frequency threshold β. Experiments reveal that a Medium strategy (0.75 ≤ β ≤ 1.00) works best for the short, high‑lexical‑diversity abstracts, while a High strategy (β ≈ 1.5) is optimal for the longer full‑paper texts.
Evaluation combines quantitative metrics (topic coherence CV, diversity, clustering quality such as silhouette score and Davies‑Bouldin index) with qualitative human assessment by two domain experts. For GPT‑5, the authors report high lexical richness and natural‑sounding labels, but also notable variability across runs (different numbers of topics, occasional hallucinated topics unrelated to the source material) and sensitivity to batch size. For FCA, the results are highly reproducible; the same β and k settings consistently yield identical concept lattices and topic clusters. However, the binary reduction can discard rare but potentially meaningful terms, leading to coarser topics, and the pipeline requires an extra labeling step because the raw concepts are expressed as sets of attributes rather than concise titles.
The comparative analysis highlights complementary trade‑offs. LLMs excel at rapid prototyping, flexible granularity control via prompt engineering, and generating human‑friendly topic names without additional post‑processing. Their drawbacks are limited context windows, stochastic output, and occasional generation of spurious topics. FCA, by contrast, offers deterministic, transparent, and mathematically grounded topic structures that are easy to audit and reproduce, making it suitable for domains where interpretability and exactness are paramount (e.g., educational material organization). Yet FCA demands careful preprocessing, parameter tuning (β, k), and may struggle with very large vocabularies or nuanced semantic relations that are not captured in a binary matrix.
The authors conclude that the choice between LLM and FCA should be driven by the specific use case: for short, well‑structured corpora where consistency and auditability are essential, FCA is preferable; for longer, heterogeneous texts where rich, readable labels are desired, LLMs provide a more user‑centric solution. They also propose a hybrid direction: using FCA‑derived concepts as seed inputs or constraints in LLM prompts could combine the interpretability of FCA with the linguistic fluency of LLMs, potentially mitigating hallucinations while preserving expressive labeling.
Overall, the paper contributes a thorough empirical benchmark, clarifies methodological considerations for researchers and practitioners, and opens avenues for future work on integrating symbolic and neural approaches to topic modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment