DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) perform strongly on many language tasks but still struggle with complex multi-step reasoning across disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents to generate multidisciplinary questions. The central insight is the notion of Design Logic, a form of reusable meta-knowledge that encapsulates the structured process human experts use to transform knowledge into complex exam questions, enabling LLMs to generate new questions with the same complex reasoning patterns from entirely different source texts with explicit control over difficulty, diversity, and question types. We use LLMs to reverse-engineer and abstract over 120,000 Design Logics from existing questions across various disciplines. By designing a two-stage retrieve-and-generate mechanism to match these Design Logics with raw corpus, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. Supervised fine-tuning (SFT) on Qwen3 and Llama3 with our data substantially improves multidisciplinary reasoning and outperforms baseline datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training.


💡 Research Summary

The paper introduces DESIGNER, a novel pipeline that synthesizes large‑scale, high‑quality, multidisciplinary reasoning questions for large language models (LLMs) by abstracting the human exam‑question design process into reusable “Design Logics.” The authors first collect a massive question bank (≈150 M items) and, using Qwen‑3‑30B, annotate each question with discipline, difficulty, and type. They then sample a balanced subset of 132 k questions and employ DeepSeek‑R1‑0528 to reverse‑engineer each question into a structured Design Logic, which captures the step‑by‑step reasoning blueprint used by educators (knowledge point identification, scenario construction, reasoning path design, answer and distractor creation, validation). After embedding these logics with Qwen‑3‑Embedding‑4B, they deduplicate them via a similarity‑threshold graph (τ = 0.85), yielding 125 k unique Design Logics across 75 disciplines.

Next, the pipeline processes two raw corpora: a proprietary book corpus (≈3 M high‑quality, discipline‑labeled passages) and a web corpus (≈1.6 M passages filtered for reasoning relevance). For each passage, the system retrieves the top‑5 most similar Design Logics using cosine similarity of embeddings (coarse retrieval). A second LLM pass (DeepSeek‑R1‑0528) then selects the best‑matching logic and generates a graduate‑level exam question that strictly follows the logic’s prescribed steps. This two‑stage retrieval‑augmented generation ensures semantic alignment between source text and reasoning structure while providing explicit control over difficulty and diversity. Generated questions undergo MinHash deduplication and 13‑gram decontamination against all evaluation benchmarks. The final datasets, DLR‑Book (3.04 M questions) and DLR‑Web (1.66 M questions), together comprise 4.7 M questions, each paired with a concise reference answer.

Quality analysis shows that the synthesized questions are markedly harder and more diverse than existing benchmarks (GSM8K, MMLU, GPQA, etc.). Using a Qwen‑3‑30B‑Instruct model to label difficulty, the proportion of “Very Hard” items in DLR‑Book and DLR‑Web exceeds baseline datasets by a factor of two to three, while “Easy” items are virtually absent (<1%). Diversity is measured via five embedding‑space metrics (mean cosine distance, mean L2 distance, 1‑NN distance, cluster inertia, radius) on 300 k sampled questions; all metrics indicate substantially higher semantic spread than comparison datasets, especially the 1‑NN distance, which is roughly double.

To assess downstream impact, the authors generate long chain‑of‑thought (CoT) responses for each question using Qwen‑3‑235B‑A22B‑Thinking‑2507‑FP8 and fine‑tune (SFT) Qwen‑3 and Llama‑3 models on the resulting question‑answer pairs. Across a suite of multidisciplinary reasoning benchmarks, models fine‑tuned on DESIGNER data achieve average improvements of 4.2 percentage points over models fine‑tuned on prior synthetic datasets (e.g., EvolInstruct, Self‑Instruct). Notably, when only the base versions of Qwen‑3 and Llama‑3 are fine‑tuned with DESIGNER data, they surpass the performance of the official “final” releases that have undergone full post‑training, demonstrating the potency of the synthesized data.

The paper’s contributions are threefold: (1) introducing Design Logic as a meta‑knowledge abstraction of human question design, (2) a two‑stage retrieval‑augmented generation framework that yields controllable, high‑difficulty, diverse questions, and (3) releasing two massive, 75‑discipline datasets that set new standards for multidisciplinary reasoning data. Limitations include potential bias transfer from the LLMs used for logic extraction, computational cost of large‑scale similarity retrieval, and the need for human verification of answer correctness. Future work is suggested on automated logic quality metrics, multimodal extensions (tables, graphs, images), and human‑LLM collaborative validation pipelines.

In summary, DESIGNER demonstrates that by formalizing and reusing the expert design process, it is possible to automatically generate vast quantities of challenging, diverse, cross‑disciplinary reasoning problems, substantially advancing the reasoning capabilities of contemporary LLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment