Socratic Students: Teaching Language Models to Learn by Asking Questions

Reading time: 6 minute
...

📝 Original Info

  • Title: Socratic Students: Teaching Language Models to Learn by Asking Questions
  • ArXiv ID: 2512.13102
  • Date: 2025-12-15
  • Authors: Rajeev Bhatt Ambati, Tianyi Niu, Aashu Singh, Shlok Mishra, Snigdha Chaturvedi, Shashank Srivastava

📝 Abstract

Large language Models (LLMs) are usually used to answer questions, but many high-stakes applications (e.g., tutoring, clinical support) require the complementary skill of asking questions: detecting missing information, requesting clarifications, and using them to solve tasks. We study this skill in reasoning-heavy domains where progress depends on inquiry rather than factual recall. We define an interactive protocol where a student model engages a stronger teacher under a small turn budget. After each teacher reply, we evaluate the student on the original task with Pass@k. We propose Outcome-Driven Question optimization Strategy (ODQS ), a training framework that learns a questioning policy from downstream task outcomes. At each turn, we sample multiple candidate questions; query the teacher with each, then score the student's resulting performance. Using these scores, we train the student via supervised fine-tuning followed by Direct Preference Optimization (DPO), without any human labels. On GSM8K, HumanEval, and OpenCoder, ODQS produces large gains over interactive baselines, boosting Pass@5 by up to 54.7% (absolute) on math and 22.9% (absolute) on coding, and matching baseline performance in three fewer turns. Thus, question asking can be explicitly trained from task outcomes, improving both accuracy and efficiency in interactive reasoning.

💡 Deep Analysis

Figure 1

📄 Full Content

Socratic Students: Training Language Models to Ask Better Questions for Reasoning Rajeev Bhatt Ambati 1∗, Tianyi Niu 1∗, Aashu Singh 2, Shlok Mishra 2, Snigdha Chaturvedi 1, Shashank Srivastava 1 1UNC Chapel Hill, 2Meta Abstract Large language Models (LLMs) are usually used to answer questions, but many high-stakes applications (e.g., tutoring, clinical support) re- quire the complementary skill of asking ques- tions: detecting missing information, request- ing clarifications, and using them to solve tasks. We study this skill in reasoning-heavy domains where progress depends on inquiry rather than factual recall. We define an inter- active protocol where a student model engages a stronger teacher under a small turn budget. After each teacher reply, we evaluate the stu- dent on the original task with Pass@k. We pro- pose Outcome-Driven Question optimization Strategy (ODQS ), a training framework that learns a questioning policy from downstream task outcomes. At each turn, we sample multi- ple candidate questions; query the teacher with each, then score the student’s resulting perfor- mance. Using these scores, we train the student via supervised fine-tuning followed by Direct Preference Optimization (DPO), without any human labels. On GSM8K, HumanEval, and OpenCoder, ODQS produces large gains over interactive baselines, boosting Pass@5 by up to 54.7% (absolute) on math and 22.9% (absolute) on coding, and matching baseline performance in three fewer turns. Thus, question asking can be explicitly trained from task outcomes, improving both accuracy and efficiency in in- teractive reasoning. 1 Introduction The dominant paradigm for language models is re- active: present a prompt and receive a response. This works beautifully when the model has the information it needs. However, many real-world applications, such as educational tutoring (Hu et al., 2023; Pan et al., 2024; Team et al., 2025; Kim et al., 2024) and medical assistance (Li et al., 2024, 2025) require models to identify uncertainties, ask ques- tions, and adapt to new information. For example, *Equal contribution. Correspondance: ambati@cs.unc.edu a diagnostic assistant must ask targeted questions before recommending treatment, or a tutor must ask probing questions to identify a student’s mis- conceptions. In these settings, knowing what to ask is the central bottleneck. In such dynamic interac- tions, models fail not because they cannot generate answers, but because they ask the wrong questions, or none at all. Recent work has explored interactive settings, in- cluding agents that ask clarifying questions (Alian- nejadi et al., 2019; Press et al., 2023; Yao et al., 2023) and student–teacher setups where a stronger model guides a weaker one (Kendapadi et al., 2025). These approaches show that interaction helps, but a key gap remains: we lack a training signal that teaches a model which questions to ask. Most methods rely on heuristics, scaffolds, or human judgments of question quality (Aliannejadi et al., 2019; Yao et al., 2023). We argue that question quality should be judged not by style or surface semantics, but by utility: does the question im- prove the model’s ability to solve the task? Some work applies reward-based refinement for clarify- ing questions (Andukuri et al., 2024; Srivastava et al., 2019), but previous work has not explored training questioning policies or interaction in rea- soning tasks. We focus on reasoning-intensive domains (math and code) and formalize the interaction via a student-teacher protocol: a student model S at- tempts a problem and is allowed to query a stronger teacher T , that provides guidance but never the fi- nal answer. The student operates under a budget of questioning turns. After each teacher response, we evaluate whether S can now solve the origi- nal problem by sampling answer-only attempts and computing Pass@k. This yields a clean operational definition of utility: a question is good if and only if it increases downstream Pass@k. Building such agents presents three core chal- lenges. The first challenge is search, since the arXiv:2512.13102v4 [cs.AI] 6 Jan 2026 Hello! Could you clarify if I need to find a way to express the total sold in terms of the number of boxes, or should I consider any other approach? You’re on the right track. Express the total number of chocolate bars in terms of the number of boxes sold by Lisa and Peter, and set up an equation to match the total. 👨🏫 Got it. Let x be the number of chocolate bars per box. Lisa sold 3.5x and Peter sold 4.5x. Together, 3.5x+4.5x=64. Is this setup correct, and should I now solve for x? Yes, that’s correct. Combine the terms on the left, then solve for x to find how many chocolate bars are in each box. 👨🏫 Great! Combining the terms, the equation simplifies to: 8x = 64. Now, I need to solve for x by dividing both sides by 8: x = 64 8 What is the value of x ? ÷ You're doing well! Divide 64 by 8 to find the value of x. This will give you the nu

📸 Image Gallery

gsm8k__T-qwen72__S-mistral7__k5__rq1.png gsm8k__T-qwen72__S-mistral7__k5__rq2.png gsm8k__T-qwen72__S-mistral7__k5__rq3.png gsm8k__T-qwen72__S-qwen7__k5__assess_positions.png gsm8k__T-qwen72__S-qwen7__k5__rq1.png gsm8k__T-qwen72__S-qwen7__k5__rq2.png gsm8k__T-qwen72__S-qwen7__k5__rq3.png gsm8k__T-qwen72__S-qwen7__progress.png gsm8k__T-qwen72__S-qwen7_similarity_judge_improved_only_selfguided.png humaneval_opc__T-qwen72__S-mistral7__k5__rq1.png humaneval_opc__T-qwen72__S-mistral7__k5__rq2.png humaneval_opc__T-qwen72__S-mistral7__k5__rq3.png humaneval_opc__T-qwen72__S-qwen7__k5__assess_positions.png humaneval_opc__T-qwen72__S-qwen7__k5__rq1.png humaneval_opc__T-qwen72__S-qwen7__k5__rq2.png humaneval_opc__T-qwen72__S-qwen7__k5__rq3.png main_diagram.png qwen23_code_k5.png qwen23_math_k5.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut