Learning to Reason in LLMs by Expectation Maximization

Learning to Reason in LLMs by Expectation Maximization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive a reward-based filtered expectation-maximization (FEM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution of rationales that justify correct answers. We instantiate and compare three sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR that conditions on the correct answer in the prompt. We experiment with LLM-as-a-judge calibration and summarization from feedback tasks, where conditioning on the correct answer provides a strong guidance for generating rationales. Our experiments show the efficacy of PPS over other sampling schemes, and that the sampling scheme can have a significant impact on performance.


💡 Research Summary

The paper “Learning to Reason in LLMs by Expectation Maximization” frames the common practice of generating a rationale (or chain‑of‑thought) before producing a final answer as a latent‑variable model (LVM). In this view, a question x, an unobserved reasoning trace z, and the correct answer y★ form a three‑node graphical model x → z → y★. The authors show that learning such a model can be cast in the classic Expectation‑Maximization (EM) framework: the E‑step computes the posterior distribution of the latent rationale given the observed pair (x, y★), while the M‑step maximizes the expected complete‑data log‑likelihood with respect to the model parameters θ.

Because exact posterior sampling is infeasible for large language models, the authors propose a practical approximation called Filtered EM (FEM). FEM replaces the true posterior with a proposal distribution q(z, y | x, y★; θ) that is implemented by prompting the current model. A single Monte‑Carlo sample (ẑ, ŷ) is drawn; if the sampled answer matches the ground‑truth (ŷ = y★) the pair is kept, otherwise it is discarded. The binary reward r(ŷ, y★) = 1


Comments & Academic Discussion

Loading comments...

Leave a Comment