Topic Modelling Black Box Optimization

Reading time: 5 minute
...

📝 Original Info

  • Title: Topic Modelling Black Box Optimization
  • ArXiv ID: 2512.16445
  • Date: 2025-12-18
  • Authors: Roman Akramov, Artem Khamatullin, Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko

📝 Abstract

Choosing the number of topics $T$ in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of $T$ as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.

💡 Deep Analysis

📄 Full Content

Topic modeling refers to a class of methods that automatically discover latent thematic structure in large text collections. The core idea is that each document can be expressed as a mixture of several topics, while each topic corresponds to a probability distribution over words. Such models are widely used for document clustering, exploratory text analysis, information retrieval, and interpretability in large corpora. Among probabilistic topic models, Latent Dirichlet Allocation (LDA) remains one of the most established approaches, and the quality of its results strongly depends on the choice of the number of topics T .

This work addresses the optimization of the key hyperparameter T (the number of topics) in the Latent Dirichlet Allocation (LDA) topic modeling approach [1], which is widely used for factorizing a “document × word” matrix as follows:

The quality of the topic model is evaluated using the standard metric of perplexity, which directly depends on the choice of the number of topics T and the hyperparameters of the prior distributions α, β. In this work, the hyperparameters are fixed as α = β = 1/T , which significantly simplifies the search procedure: the target function takes the form f (T ), where f is the procedure for building and validating LDA for the selected T .

Since the analytical form and gradients of the function f (T ) are unavailable, the problem reduces to black-box optimization over the variable T . Hyperparameter tuning via black-box optimization has been widely studied in the context of machine learning, for example, using Gaussian-process-based Bayesian optimization [14]. Motivated by this black-box setting, we consider the following four optimization strategies and conduct a systematic comparison of their effectiveness:

• Evolution Strategy (ES): iterative improvement of solutions is performed through mutations (random changes in the number of topics) and the selection of the best individuals from a combined population of parents and offspring, following standard practices in evolutionary computation [2].

• Genetic Algorithm (GA): selection for the next generation uses tournament selection, in which the best individuals from the current generation compete with the best individuals from the previous generation.

For generating new candidates, binary crossover is applied -a procedure where the binary representations of the topic count T are combined, enabling offspring to inherit properties from both parents.

• Preferential Amortized Black-Box Optimization (PABBO): optimization is performed using only pairwise preference-based feedback, where the optimizer receives responses such as “point x is better than point x ′ " rather than numerical evaluations. A neural surrogate model learns to estimate the probability that one candidate is better than another, and reinforcement learning (RL) is used to train a policy for selecting new candidates [18]. The approach allows for rapid adaptation to new tasks and effectively searches for optima when only comparative judgments are available.

• Sharpness-Aware Black-Box Optimization (SABBO): optimization is performed using sharpnessaware minimization strategy in black-box settings [16]. At each iteration, SABBO adapts the search distribution parameters by minimizing the worst-case expected objective. The algorithm applies stochastic gradient approximations using only function queries. SABBO provides theoretical guarantees of convergence and generalization, and is scalable to high-dimensional optimization tasks.

The study provides a detailed analysis of selection algorithms, mutation mechanisms, and crossover operations, and compares the effectiveness of different black-box optimization methods for tuning the number of topics in LDA with respect to the quality of topic modeling.

Probabilistic topic models provide a latent, low-dimensional representation of large text corpora by modeling documents as mixtures of topics and topics as distributions over words. The most widely used baseline is Latent Dirichlet Allocation (LDA), introduced by Blei, Ng, and Jordan as a generative Bayesian model for collections of discrete data such as text corpora [1]. In LDA, each document is represented by a multinomial distribution over topics, and each topic by a multinomial distribution over words, both regularized by Dirichlet priors.

A substantial line of work has emphasized the importance of properly choosing and tuning these Dirichlet priors. Wallach et al. showed that asymmetric priors over document-topic distributions can significantly improve perplexity and robustness, and that automatic hyperparameter optimization reduces the sensitivity of LDA to the number of topics while avoiding the complexity of fully nonparametric models [15]. This motivates treating LDA configuration as an explicit hyperparameter optimization problem rather than fixing priors heuristically.

Choosing the number of topics T is another central challenge. A common strateg

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut