Cliqueformer: Model-Based Optimization with Structured Transformers

Cliqueformer: Model-Based Optimization with Structured Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large neural networks excel at prediction tasks, but their application to design problems, such as protein engineering or materials discovery, requires solving offline model-based optimization (MBO) problems. While predictive models may not directly translate to effective design, recent MBO algorithms incorporate reinforcement learning and generative modeling approaches. Meanwhile, theoretical work suggests that exploiting the target function’s structure can enhance MBO performance. We present Cliqueformer, a transformer-based architecture that learns the black-box function’s structure through functional graphical models (FGM), addressing distribution shift without relying on explicit conservative approaches. Across various domains, including chemical and genetic design tasks, Cliqueformer demonstrates superior performance compared to existing methods.


💡 Research Summary

Cliqueformer addresses a fundamental challenge in offline model‑based optimization (MBO): the distribution shift that occurs when the surrogate model is pushed into regions of the design space that are under‑represented in the training data. Traditional MBO pipelines first learn a regression model fθ(x) from a fixed dataset D = {(xi, yi)} and then maximize this model to propose new designs. Because the data typically cover only a small fraction of the high‑dimensional design space, the optimization step often generates candidates that lie far from the data manifold, leading to over‑optimistic predictions and poor real‑world performance. Existing remedies—offline reinforcement‑learning constraints, conservative regularizers such as COMs, or generative‑model‑based sampling—mitigate the problem but at the cost of severely restricting exploration.

Cliqueformer’s key insight is to exploit the functional graphical model (FGM) of the black‑box objective. An FGM is a graph over the input variables where an absent edge (i, j) guarantees that the contributions of xi and xj to the objective are additive and independent. Consequently, the objective can be decomposed into a sum over maximal cliques C of the graph: f(x) = ΣC fC(xC). Theoretical work (Grudzien et al., 2024) shows that, under this decomposition, the regret bound for MBO depends only on the coverage of each clique’s marginal distribution, not on the coverage of the full joint space. In practice, this means that if the dataset sufficiently spans each low‑dimensional clique, the surrogate can be accurate enough to guide optimization even when the full space is sparsely sampled.

To operationalize this theory, Cliqueformer introduces two design desiderata.

  1. Pre‑defined clique decomposition – The model is forced to split the latent representation z of an input into Nclique overlapping sub‑vectors (cliques) of dimension dclique, with a knot size dknot controlling overlap between consecutive cliques. Each clique is processed independently by a small MLP (or transformer block) equipped with a sinusoidal embedding that uniquely identifies the clique. The final prediction is the arithmetic mean of the clique‑wise outputs, which is mathematically equivalent to a sum but more numerically stable when the number of cliques grows. By fixing the clique structure a priori, the architecture sidesteps the intractable problem of learning an arbitrary sparse graph from data.
  2. Broad coverage of each clique’s latent distribution – To prevent any single clique from collapsing to a narrow region (which would re‑introduce distribution shift), Cliqueformer places a variational information bottleneck (VIB) on each clique. Unlike the classic VIB that regularizes the joint latent vector, Cliqueformer samples a single clique uniformly at each training step and minimizes the KL divergence between its conditional encoder distribution eθ(zC|x) and a standard normal prior. Because cliques overlap, the knot dimensions receive regularization more frequently, encouraging a well‑spread latent space across all dimensions.

The training objective combines a standard regression loss (e.g., mean‑squared error) with the per‑clique VIB term, yielding Lclique(θ) = Lreg(θ) + β·E_{i∼U}


Comments & Academic Discussion

Loading comments...

Leave a Comment