DeFrame: Debiasing Large Language Models Against Framing Effects

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing – differences in how semantically equivalent prompts are expressed (e.g., “A is better than B” vs. “B is worse than A”) – as an underexplored contributor to this gap. We first introduce the concept of “framing disparity” to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

💡 Research Summary

The paper “DeFrame: Debiasing Large Language Models Against Framing Effects” investigates a previously under‑explored source of hidden bias in large language models (LLMs): the framing effect. While LLMs are known to be sensitive to prompt wording, most fairness evaluations and debiasing methods assume a single, fixed phrasing of a stereotype or demographic question. The authors demonstrate that even semantically equivalent prompts—e.g., “A is better than B” versus “B is worse than A”—can lead to markedly different bias measurements, a phenomenon they term “framing disparity.”

To quantify this phenomenon, the authors define a formal metric, Framing Disparity (FD), as the difference between a model’s bias scores under positive (+) and negative (–) framings of the same set of prompts. The bias score ϕ is taken from the original benchmark (e.g., accuracy‑based bias in BBQ, harmful‑response rate in DoNotAnswer, or discrimination logits in 70Decisions). FD can be positive or negative, indicating which framing yields higher bias; its absolute value |FD| captures the magnitude of framing‑induced variability.

The study augments three widely used fairness benchmarks with systematic framing variations:

BBQ (Bias Benchmark for Question answering) – The original dataset already separates questions into negative and non‑negative forms; these are used as P⁻ and P⁺ respectively. Bias is measured by a composite score that combines accuracy loss and the proportion of stereotypical answers.
DoNotAnswer‑Framed – 95 stereotype‑related prompts are identified, each flipped to the opposite polarity using an LLM, and then paraphrased four times, yielding 520 distinct inputs. Harmful‑response rate (HRR) serves as the bias metric.
70Decisions‑Framed – Binary decision‑making questions about gender and race are duplicated with opposite polarity, creating 18,900 prompts. A logit‑based discrimination score (difference between target and majority group logits) is used as ϕ.

Across eight state‑of‑the‑art LLMs (including GPT‑4, LLaMA‑2, Qwen2.5, etc.), the authors find that FD is both frequent and substantial. For example, in BBQ the bias under negative framing can be twice as large as under positive framing, and in some categories up to four times larger. Existing prompting‑based debiasing techniques (explicit fairness instructions, chain‑of‑thought reasoning, self‑refinement) improve overall bias averages but often leave |FD| unchanged or even increase it, indicating that they do not address framing‑specific instability.

To remedy this gap, the authors propose DeFrame, a framing‑aware debiasing framework inspired by dual‑process theory (System 1 vs. System 2). DeFrame operates in two stages:

Initial Generation – The model answers the original prompt (e.g., positive framing).
Framing‑Contrast Revision – The same model is given the opposite framing, asked to generate a “fairness guideline,” and then to revise its initial answer accordingly.

By explicitly incorporating the alternative framing, DeFrame forces the model to reason beyond surface cues, effectively simulating a slower, deliberative System 2 correction of the fast, intuition‑driven System 1 response.

Empirical results show that DeFrame dramatically reduces both overall bias and framing disparity. On BBQ, average FD drops by 92 % and the overall bias score by 93 % relative to the baseline. Similar improvements are observed on DoNotAnswer‑Framed and 70Decisions‑Framed, where DeFrame consistently outperforms prior debiasing methods and, crucially, does not exacerbate framing disparity. Ablation studies reveal a trade‑off: adding more successful prompting stages (i.e., more iterations of the contrast‑revision loop) further stabilizes responses across framings but incurs higher inference cost.

The paper’s contributions are threefold: (1) introduction of the framing disparity metric to expose hidden, framing‑dependent bias; (2) empirical evidence that current LLMs exhibit significant framing disparity and that existing debiasing methods fail to mitigate it; (3) a novel, model‑agnostic debiasing framework (DeFrame) that jointly reduces overall bias and framing disparity.

In the discussion, the authors note several avenues for future work: extending the approach to multi‑valued (non‑binary) framings, applying framing‑aware debiasing to domain‑specific settings (e.g., legal or medical advice), and developing more efficient inference strategies for real‑time deployment. Overall, the study highlights the necessity of incorporating framing considerations into fairness evaluation pipelines and offers a concrete, effective method to make LLMs more robust against framing‑induced bias.

DeFrame: Debiasing Large Language Models Against Framing Effects

💡 Research Summary

Comments & Academic Discussion

Leave a Comment