Label Curation Using Agentic AI

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data annotation is essential for supervised learning, yet producing accurate, unbiased, and scalable labels remains challenging as datasets grow in size and modality. Traditional human-centric pipelines are costly, slow, and prone to annotator variability, motivating reliability-aware automated annotation. We present AURA (Agentic AI for Unified Reliability Modeling and Annotation Aggregation), an agentic AI framework for large-scale, multi-modal data annotation. AURA coordinates multiple AI agents to generate and validate labels without requiring ground truth. At its core, AURA adapts a classical probabilistic model that jointly infers latent true labels and annotator reliability via confusion matrices, using Expectation-Maximization to reconcile conflicting annotations and aggregate noisy predictions. Across the four benchmark datasets evaluated, AURA achieves accuracy improvements of up to 5.8% over baseline. In more challenging settings with poor quality annotators, the improvement is up to 50% over baseline. AURA also accurately estimates the reliability of annotators, allowing assessment of annotator quality even without any pre-validation steps.

💡 Research Summary

The paper introduces AURA (Agentic AI for Unified Reliability Modeling and Annotation Aggregation), a framework that automates large‑scale, multimodal data labeling by orchestrating multiple off‑the‑shelf AI agents (large language models, vision models, and multimodal systems) without any fine‑tuning or in‑context learning. The core technical contribution is the AEML algorithm (Agentic Expectation‑Maximization for Labeling), which adapts the classic Dawid‑Skene crowd‑sourcing model to the modern setting where each “annotator” is an autonomous AI agent with its own, unknown confusion matrix.

AEML operates in an EM loop. In the E‑step, given current estimates of each agent’s confusion matrix Θ⁽ᵃ⁾ and class priors p_y, it computes posterior probabilities w_{i,y} = Pr(y_i = y | observed labels from all agents). This step effectively weights each agent’s prediction by its estimated reliability. In the M‑step, the algorithm updates each Θ⁽ᵃ⁾ by expectation‑weighted counts of observed labels, and also refines the class priors. The process repeats until the change in log‑likelihood falls below a threshold γ, guaranteeing monotonic improvement and convergence to a stationary point. Initialization assumes uniform class priors and a diagonal dominance parameter λ (e.g., 70‑90 % correct) for each agent’s confusion matrix; λ is a hyper‑parameter that controls the assumed baseline reliability.

Complexity analysis shows the E‑step costs O(n·M·|Y|) and the M‑step O(n·M·|Y|²), yielding overall O(e·n·M·|Y|²) for e iterations. In practice, with datasets of a few thousand instances, a handful of agents (7‑12), and up to a few hundred classes, convergence occurs within 10‑15 iterations, making the method feasible on a standard CPU.

The authors evaluate AURA on four publicly available benchmarks: Kinetics‑400 (video, 1 000 clips, 60 classes), ImageNet‑ReaL (image, 4 271 samples, 50 classes), Food‑101 (image, 1 000 samples, 101 classes), and CUB‑200 (image, 1 018 samples, 17 classes). For each dataset, 7‑12 state‑of‑the‑art multimodal models (e.g., Gemini‑2.5‑flash, Qwen‑2.5‑VL, Pegasus‑1.2, GPT‑4o‑mini, GPT‑5‑mini, LLaVA‑13B, MoonDream‑2) are invoked via API calls to generate raw predictions. No model is fine‑tuned; the same prompt template is used across modalities.

Results show that AURA’s aggregated labels consistently outperform any single agent. Accuracy gains range from 3.2 % to 5.8 % over the best individual model, and in “adverse” scenarios where many agents are low‑quality (e.g., noisy video frame extraction), the improvement can reach 30 %–50 %. Moreover, the learned confusion matrices correlate strongly (r ≈ 0.87) with the true per‑agent accuracies measured against the ground truth, confirming that AURA reliably estimates agent reliability without any external validation data. The framework also reveals class‑specific biases, such as systematic over‑ or under‑prediction of certain food categories, enabling downstream diagnostics.

Key advantages of AURA are: (1) zero labeling cost beyond API usage, (2) principled handling of heterogeneous agent reliability, leading to higher-quality aggregated labels, and (3) provision of quantitative reliability scores that can guide agent selection, weighting, or pruning in future pipelines. Limitations include EM’s sensitivity to initialization (potentially converging to local optima) and linear scaling of computation with the number of agents, which may become burdensome in very large ensembles. The current formulation also assumes single‑label classification; extensions to multilabel, sequence labeling, or hierarchical taxonomies are left for future work.

The authors suggest several directions for further research: incorporating variational Bayesian inference or deep meta‑learners to improve scalability, integrating uncertainty‑aware active learning loops that query human annotators only when posterior confidence is low, and adding interactive feedback mechanisms where agents can request clarification or re‑evaluate ambiguous instances.

In summary, AURA demonstrates that by explicitly modeling the reliability of autonomous AI agents through a probabilistic EM framework, it is possible to achieve cost‑effective, high‑accuracy labeling for large‑scale multimodal datasets. This work paves the way for more robust, automated data curation pipelines essential for the next generation of AI systems.

Label Curation Using Agentic AI

💡 Research Summary

Comments & Academic Discussion

Leave a Comment