Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Prior work on fairness in large language models (LLMs) has primarily focused on access-level behaviors such as refusals and safety filtering. However, equitable access does not ensure equitable interaction quality once a response is provided. In this paper, we conduct a controlled fairness audit examining how LLMs differ in tone, uncertainty, and linguistic framing across demographic identities after access is granted. Using a counterfactual prompt design, we evaluate GPT-4 and LLaMA-3.1-70B on career advice tasks while varying identity attributes along age, gender, and nationality. We assess access fairness through refusal analysis and measure interaction quality using automated linguistic metrics, including sentiment, politeness, and hedging. Identity-conditioned differences are evaluated using paired statistical tests. Both models exhibit zero refusal rates across all identities, indicating uniform access. Nevertheless, we observe systematic, model-specific disparities in interaction quality: GPT-4 expresses significantly higher hedging toward younger male users, while LLaMA exhibits broader sentiment variation across identity groups. These results show that fairness disparities can persist at the interaction level even when access is equal, motivating evaluation beyond refusal-based audits.

💡 Research Summary

The paper “Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness” investigates fairness in large language models (LLMs) beyond the traditional focus on access‑level behavior such as refusals. The authors argue that even when a model grants access uniformly, the quality of the interaction—tone, expressed uncertainty, sentiment, and politeness—may differ across demographic groups, potentially influencing user trust and downstream decisions.

To isolate identity effects, the study adopts a paired, counterfactual prompt design. For each of 30 career‑advice prompts, eight identity descriptors are generated by varying three protected attributes: age (younger vs. older), gender (male vs. female), and nationality (US‑born vs. immigrant). The identity statement is inserted at the beginning of the prompt, while the task description and a strict output contract (fixed structure and length) remain unchanged. This ensures that the only difference between paired inputs is the demographic cue.

Two widely used LLMs are audited: GPT‑4 (gpt‑4‑0125‑preview) and LLaMA‑3.1‑70B (Meta’s open‑weight model). Both are queried with low temperature (T = 0.2) and nucleus sampling (top‑p = 0.9) to reduce stochastic variation while preserving realistic deployment conditions (no fixed random seed). Each identity‑prompt pair receives a single completion per model.

Fairness is evaluated on three dimensions: (1) access fairness via refusal analysis, (2) interaction quality using automated linguistic metrics—sentiment (lexicon‑based score), politeness (classifier trained on the Stanford Politeness Corpus), and hedging (proportion of uncertainty markers such as “might”, “could”, “perhaps”), and (3) statistical significance of identity‑conditioned differences using paired t‑tests or Wilcoxon signed‑rank tests with Bonferroni correction for multiple comparisons.

Results show zero refusals for both models, confirming equal access. However, systematic disparities emerge in interaction quality. GPT‑4 exhibits significantly higher hedging when responding to “younger male” users (average increase of 12 percentage points, p < 0.01). LLaMA‑3.1‑70B displays broader sentiment variation across identity groups, with a notable 0.18 sentiment score gap between immigrant and US‑born users (p < 0.05). Politeness differences are present but less pronounced. These findings demonstrate that model‑specific biases persist at the interaction level even under uniform access.

The paper’s contributions are threefold: (i) introducing a rigorous counterfactual, paired‑prompt methodology for isolating demographic effects in generative LLM outputs, (ii) providing a lightweight yet statistically sound pipeline that combines automated linguistic metrics with paired hypothesis testing, and (iii) empirically showing that post‑access fairness audits are essential for modern instruction‑tuned models.

Limitations include the focus on a single advisory domain (career advice) and a modest set of prompts, reliance on a single generated response per identity (which may not capture intra‑model variability), dependence on automated metrics that may miss nuanced cultural cues, and a simplified identity space that does not fully represent intersectional realities.

Future work should expand to multiple high‑stakes domains (legal, medical, financial), increase the number of prompts and repetitions per identity, incorporate human evaluations to validate automated scores, explore richer intersectional identity configurations, and investigate “LLM‑in‑the‑loop” evaluation frameworks where another LLM or a meta‑model assesses interaction quality.

Overall, the study highlights a critical blind spot in current LLM fairness assessments: equal access does not guarantee equal experience. By exposing systematic tone and uncertainty differences tied to demographic cues, the work urges developers, policymakers, and auditors to adopt comprehensive post‑access evaluation protocols before deploying LLM‑driven advisory systems.

Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness

💡 Research Summary

Comments & Academic Discussion

Leave a Comment