ProToken: Token-Level Attribution for Federated Large Language Models
Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel Provenance methodology for Token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.
💡 Research Summary
The paper introduces ProToken, a provenance framework that attributes each generated token of a federated large language model (LLM) to the client(s) whose private data most influenced it. In federated learning (FL), multiple organizations collaboratively fine‑tune a shared LLM without exposing raw data, but once the global model generates text it is unclear which client contributed to a particular response. This opacity hampers debugging, malicious client detection, fair reward allocation, and overall trust.
ProToken rests on two technical insights. First, transformer architectures concentrate task‑specific signals in the later blocks—especially the self‑attention output projections and the final feed‑forward layers. By restricting analysis to these layers, the method avoids the prohibitive cost of examining every neuron in a billion‑parameter model. Second, a gradient‑based relevance weighting (similar to Integrated Gradients) is applied to the selected layers, producing a per‑token, per‑client relevance score that automatically suppresses activations unrelated to the token being generated.
Methodologically, the authors exploit the linearity of common FL aggregation schemes (FedAvg, FedProx). The global model’s forward pass can be expressed as a weighted sum of the per‑client parameter contributions. During autoregressive generation, for each token t they compute activations h_i^L(t) and gradients ∂ℓ/∂h_i^L(t) at the chosen late layers L for every client i. The product h_i^L(t)·∂ℓ/∂h_i^L(t) yields a client‑specific attribution score s_i(t). Summing s_i(t) over the entire output sequence produces a final attribution vector P∈ℝ^K, where the highest entry indicates the primary responsible client. All computations involve only model updates, activations, and gradients—no raw client data—thus preserving FL privacy guarantees.
A major contribution of the work is a rigorous evaluation protocol. Since real ground‑truth provenance is unavailable, the authors inject distinct backdoor trigger‑response pairs into the local training data of selected clients. When a prompt containing the trigger is issued, the global model emits the pre‑programmed sentinel response, providing an unambiguous signal that the responsible client is the one that received the backdoor. Using this setup, ProToken is tested across 16 configurations: four LLM families (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical reasoning, coding). The framework achieves an average token‑level attribution accuracy of 98.62 %. When scaling the number of participating clients from 5 to 55 (a 9.2× increase), accuracy remains above 92 %, demonstrating robustness to client‑population growth.
The authors also discuss practical implications. ProToken can be employed to (i) pinpoint a malicious or poorly performing client during inference, (ii) allocate monetary or reputation‑based rewards proportionally to actual contribution, and (iii) guide client‑selection strategies for future FL rounds to improve overall model quality. Limitations include reliance on backdoor‑based ground truth, which may not capture all real‑world failure modes, and the residual computational load of processing late‑layer activations in extremely large models. Future work is suggested on more efficient gradient approximations, real‑time attribution dashboards, and extending the approach to multimodal generative models.
In summary, ProToken is the first method to provide fine‑grained, token‑level provenance for federated LLMs while respecting privacy constraints. By leveraging transformer layer characteristics and gradient relevance, it delivers high accuracy and scalability, opening a path toward more transparent, accountable, and trustworthy federated generative AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment