MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have proven effective in artificial intelligence, where the multi-agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS-based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision-making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity. Specifically, we first conduct a Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto-optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade-offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.

💡 Research Summary

The paper introduces Masked Agent Collaboration (MAC), a novel framework for improving medical decision‑making with large language models (LLMs) organized as a multi‑agent system (MAS). The authors identify two major shortcomings of existing MAS‑based approaches: (1) the lack of a systematic pipeline for selecting a set of collaborative agents from a heterogeneous pool of LLMs, and (2) the reliance on static collaboration patterns that cannot adapt to failures or inconsistencies among agents. To address these issues, MAC combines Pareto‑optimal agent construction with a cross‑consistency maximization mechanism that dynamically masks unreliable agents during inference.

Pareto‑optimal agent construction
The authors first define four quantitative criteria for each candidate LLM: model size (parameter count), inference time, diversity score, and throughput (tokens per second). The diversity score is computed by generating multiple outputs for the same prompt from a single model and measuring semantic similarity among those outputs; a lower similarity indicates higher diversity. Using these four dimensions, they formulate a multi‑objective optimization problem and identify the Pareto frontier of models that simultaneously balance efficiency (size, time) and capability (diversity, throughput). This systematic selection replaces ad‑hoc choices based solely on accuracy and ensures that the resulting agent pool is both performant and resource‑conscious, a crucial consideration for real‑world clinical deployments.

Cross‑consistency maximization and masking
Once a Pareto‑optimal set of agents is assembled, MAC evaluates pairwise semantic similarity between the agents’ outputs for a given query, defining this similarity as the cross‑consistency (CC) value. The agent with the lowest CC—i.e., the most inconsistent with the rest—is masked out for the current layer. This masking is performed iteratively at each layer, preventing an agent that is likely to produce erroneous or hallucinated information from influencing subsequent reasoning steps. By removing the least consistent contributor early, the framework mitigates the propagation of misinformation, a critical risk in medical contexts where over‑confidence can lead to harmful recommendations.

Adaptive progressive propagation
The remaining (unmasked) agents from the previous layer are then fed as contextual prompts to each agent in the current layer. Each agent generates a new response that incorporates the collective wisdom of its peers, akin to a round of expert discussion. This process repeats across multiple layers, progressively refining the answer while continuously pruning inconsistent voices. The architecture thus mimics human expert panels that iteratively converge on a consensus, enhancing both accuracy and interpretability.

Experimental evaluation
MAC is evaluated on three specialized medical QA datasets, including the Obstetrics and Gynecology subset of NEJM‑QA, MedQA, and an additional clinical question set. Baselines include a large‑scale MAS composed of 70B‑141B open‑access LLMs (the “Debate/MoA” style models) and the proprietary GPT‑4 system. Results show that MAC outperforms the MAS baseline by an average of 16.55 percentage points and surpasses GPT‑4 by 9.35 points on the NEJM‑QA task. Moreover, because the Pareto selection incorporates smaller, faster models, MAC achieves comparable or lower memory consumption and inference latency than the baselines, demonstrating a favorable trade‑off between performance and efficiency. Ablation studies confirm that both the Pareto‑optimal selection and the cross‑consistency masking contribute significantly to the overall gains.

Critical analysis and limitations
The paper’s contributions are clear and well‑motivated. The use of Pareto optimization to balance multiple deployment constraints is a practical advance, and the dynamic masking based on cross‑consistency offers a principled way to handle agent disagreement. However, several aspects merit further scrutiny:

Diversity and throughput metrics – The diversity score depends on the sampling strategy (temperature, top‑k) and the number of generated samples per model. Different settings could inflate or deflate diversity, potentially affecting the Pareto frontier. A sensitivity analysis would strengthen confidence in the robustness of the selection process.
Cross‑consistency computation – The authors employ a fixed semantic embedding model to measure similarity. If this embedding is not fine‑tuned on medical terminology, it may misjudge consistency for domain‑specific phrasing, leading to suboptimal masking decisions. Incorporating a medically specialized encoder or an ensemble of similarity measures could improve reliability.
Masking policy – Currently only the single lowest‑CC agent is masked per layer. In scenarios where multiple agents produce divergent yet plausible answers, a more nuanced approach (e.g., weighted aggregation based on CC scores or masking a set of low‑CC agents) might yield better robustness.
Evaluation scope – The benchmarks focus on multiple‑choice QA, which, while informative, do not fully capture the generative demands of real clinical workflows such as summarizing patient records, drafting treatment plans, or handling ambiguous symptom descriptions. Extending evaluation to these tasks would demonstrate broader applicability.
Reproducibility – The paper provides high‑level algorithmic descriptions but lacks detailed hyper‑parameter settings for the Pareto optimization (e.g., population size, number of generations) and for the similarity thresholds used in masking. Open‑sourcing the code and data splits would facilitate independent verification.

Future directions
Potential extensions include: (a) integrating uncertainty estimation (e.g., Monte Carlo dropout) into the CC calculation to better capture confidence; (b) exploring hierarchical agent groups where specialized sub‑agents (e.g., imaging, lab results) collaborate under a chief coordinator; (c) applying reinforcement learning to adaptively learn masking policies based on downstream clinical outcomes; and (d) testing MAC in a live clinical decision‑support setting to assess real‑time performance and user acceptance.

Conclusion
Masked Agent Collaboration (MAC) presents a compelling approach to harness the collective power of heterogeneous LLMs for medical decision‑making. By systematically selecting Pareto‑optimal agents and dynamically pruning inconsistent contributors through cross‑consistency masking, MAC achieves higher accuracy than both large‑scale MAS ensembles and state‑of‑the‑art single models while maintaining computational efficiency. The framework addresses a critical gap in current MAS research—namely, the need for adaptable, resource‑aware collaboration mechanisms in high‑stakes domains such as healthcare. With further refinement of similarity metrics, masking strategies, and broader clinical evaluations, MAC could become a foundational component of trustworthy AI‑augmented medical practice.

MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

💡 Research Summary

Comments & Academic Discussion

Leave a Comment