FMBench: Adaptive Large Language Model Output Formatting

FMBench: Adaptive Large Language Model Output Formatting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.


💡 Research Summary

The paper addresses a practical yet under‑explored problem in large language model (LLM) deployment: generating outputs that are not only semantically correct but also conform to strict Markdown formatting requirements. While existing alignment work (instruction tuning, RLHF) has markedly improved the ability of LLMs to follow natural‑language instructions, real‑world systems often demand that responses be parsable by downstream tools and readable by humans. Small formatting slips—broken nested lists, malformed tables, inconsistent heading levels, or unbalanced fenced code blocks—can cause parsing failures, degrade user experience, and undermine the reliability of automated pipelines.

To fill this gap, the authors introduce FMBench, a dedicated benchmark for adaptive Markdown output formatting. The dataset construction pipeline consists of four stages: (1) crawling raw documents from eight domains (academic, official, technical, legal, business, education, etc.), (2) cleaning and normalizing these texts, (3) applying three predefined formatting rules at three difficulty levels to automatically generate target Markdown structures, and (4) human expert verification to correct residual errors. The pipeline preserves the original content order, enforces deterministic sampling of structural specifications, and records extensive metadata (difficulty level, validator ID, etc.) for reproducibility. The final corpus contains 1,100 high‑quality Markdown documents (800 for training, 300 for testing). Statistical analysis shows that section counts and block‑quote numbers are kept moderate, while nested‑list depth and list‑item counts provide the primary source of structural complexity, mirroring realistic long‑form formatting scenarios.

Evaluation uses two complementary metrics: (a) Semantic Score, measured by BERTScore‑F1 between generated and reference texts, capturing content preservation; (b) Structure Score, a rule‑based reward that penalizes violations such as heading‑level mismatches, list‑nesting errors, malformed tables, and unbalanced code fences. Both scores are averaged over the test set.

The core methodological contribution is a lightweight alignment pipeline that combines supervised fine‑tuning (SFT) with reinforcement learning fine‑tuning (RLFT). First, an SFT stage trains the model on instruction–response pairs, dramatically improving semantic fidelity. Next, RLFT refines the SFT policy using a composite reward that linearly combines the semantic and structural components, with tunable weights to balance the two objectives. This design avoids hard decoding constraints at inference time, thereby keeping latency low while still encouraging the model to internalize formatting behavior.

Experiments are conducted on two distinct model families—OpenPangu and Qwen. Results show that SFT alone yields substantial gains in semantic scores (≈4–6% absolute improvement) but only modest improvements in structural compliance. Adding RLFT leads to an additional 8–12% increase in Structure Score, especially on high‑difficulty samples that involve deep list nesting and complex tables. However, overly aggressive weighting of the structural term can degrade semantic quality, highlighting a clear trade‑off that must be managed via careful reward design. Notably, Qwen’s stronger baseline SFT performance results in a smaller marginal gain from RLFT, suggesting that the effectiveness of the RL stage depends on the quality of the initial supervised policy.

The authors also compare their approach to alternative strategies such as prompt engineering, constrained decoding, and pure post‑training format‑aware fine‑tuning. Prompt‑based methods prove brittle to wording changes; constrained decoding guarantees syntactic correctness but incurs significant computational overhead and can interfere with natural language flow; pure format‑aware fine‑tuning often overfits to superficial patterns and fails to generalize to unseen structural variations. In contrast, the proposed SFT + RL pipeline achieves a favorable balance: it reduces formatting errors without sacrificing inference speed and demonstrates better generalization across diverse Markdown specifications.

In the discussion, the paper points out several avenues for future work. Automated reward engineering—e.g., learning reward weights or employing meta‑learning to adapt to new format schemas—could further alleviate the manual tuning burden. Extending the benchmark and training pipeline to other hybrid markup languages such as reStructuredText or LaTeX would test the generality of the approach. Finally, integrating human‑in‑the‑loop feedback (e.g., preference data on formatting quality) could complement the rule‑based rewards and improve alignment with subjective human notions of “readable” Markdown.

In summary, FMBench provides a rigorously constructed, reproducible benchmark for Markdown‑constrained generation, and the proposed SFT‑then‑RL alignment pipeline demonstrates that modest additional reinforcement learning can substantially boost structural compliance while preserving semantic fidelity. This work bridges a critical gap between LLM instruction following and the practical formatting demands of real‑world applications, offering a scalable path toward more reliable, user‑friendly AI assistants and tool‑augmented workflows.


Comments & Academic Discussion

Loading comments...

Leave a Comment