JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks
Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains. Our code is publicly available at https://github.com/smiling-world/JADE.
💡 Research Summary
The paper “JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks” introduces a novel two-layer framework designed to address the fundamental stability-adaptivity dilemma in evaluating AI agents on open-ended professional tasks. Traditional static rubrics offer rigor but lack flexibility for diverse valid solutions, while LLM-as-a-judge methods provide adaptability but suffer from instability and bias. JADE (Judge Agents with Dynamic Evaluation) resolves this by mimicking the human expert evaluation process, which decouples domain-general principles from case-specific evidence assessment.
The JADE framework operates through two distinct layers. Layer 1: Query-Specific Checklist Generation encodes expert knowledge into a predefined set of evaluation skills (e.g., regulatory compliance, market analysis). For a given user query, a deterministic process activates a relevant subset of these skills and composes them into a query-specific checklist. This layer ensures evaluation stability, as the same query always yields the same foundational criteria. Layer 2: Report-Specific Checklist and Claim Extraction dynamically analyzes the agent-generated response report. It extracts verifiable factual claims (e.g., “Supplier X is FDA-certified”) and reasoning claims (e.g., conclusions, strategic insights), then generates a report-specific checklist tailored to the response’s unique content. This enables adaptive assessment of diverse reasoning strategies.
The evaluation pipeline employs a dual-track verification system. A Verification Agent performs real-time web searches to validate factual claims, assigning a reliability score. An LLM Judge evaluates the response against the criteria in both the query-specific and report-specific checklists. A critical innovation is Dependency Gating: if a reasoning checklist item depends on a factual claim that fails verification (below a confidence threshold), that reasoning item receives a score of zero. This prevents well-reasoned but factually unsupported conclusions from inflating the overall score. The final evaluation score is the product of the normalized, weighted reasoning score and the aggregated evidence reliability score, ensuring that both sound reasoning and factual correctness are required for a high score.
To empirically validate JADE, the authors constructed BizBench, a benchmark of 150 open-ended strategic sourcing queries reflecting authentic, temporally dynamic professional workflows. Experiments on BizBench demonstrate that JADE significantly improves evaluation stability compared to holistic LLM judges. More importantly, it uncovers critical agent failure modes—such as citation hallucination, shallow reasoning, and methodology-masked non-completion—that are often missed by conventional evaluators. The framework also shows strong alignment with expert-authored rubrics and exhibits effective transfer to a medical-domain benchmark, confirming its generalizability across professional domains.
In summary, JADE provides a systematic, transparent, and robust method for evaluating agentic AI in complex, open-ended scenarios. By grounding evaluation in expert-defined skills while allowing dynamic, claim-level assessment, it bridges the gap between rigorous standardization and the necessary flexibility to judge diverse, valid professional responses.
Comments & Academic Discussion
Loading comments...
Leave a Comment