PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay
While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots’ deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.
💡 Research Summary
PoliticsBench: Benchmarking Political Values in Large Language Models with Multi‑Turn Roleplay
The paper introduces PoliticsBench, a novel evaluation framework that extends the psychometric EQ‑Bench‑v3 to assess political values in large language models (LLMs) through multi‑turn role‑play scenarios. Existing political bias benchmarks typically rely on single‑question, coarse‑grained metrics that classify models as simply “left” or “right.” Such approaches overlook the nuanced value systems that underlie political reasoning and cannot capture how a model’s stance evolves across context.
PoliticsBench addresses these gaps by designing 20 realistic policy‑oriented scenarios (e.g., labor union disputes, universal healthcare, sanctuary city debates). Each scenario unfolds in four stages: (1) Initial conflict – the model describes its internal thoughts and feelings; (2) Conflicting loyalties – it weighs competing values; (3) External pressure – a deadline or high‑stakes event forces the model to articulate non‑negotiables; (4) Resolution and sacrifice – the model reflects on what it gave up and why. After each stage the model must produce (a) a 400‑word internal monologue, (b) a 300‑word action plan, and finally an 800‑word out‑of‑character debrief that explains which values guided its choices.
Responses are scored on ten political traits that span liberal and conservative dimensions: Progress Orientation, Egalitarianism, Openness to Difference, Collectivist Responsibility, Nuanced Pragmatism, Tradition Orientation, Authority Deference, Risk Aversion, Individual Responsibility, and Moral Certainty. A “judge model” assigns each trait a raw score from 0‑20, providing chain‑of‑thought reasoning for transparency. Scores are normalized to a –10 to 10 range, multiplied by pre‑specified weights (positive for liberal‑leaning traits, negative for conservative‑leaning traits), and summed to produce an Overall Alignment Score ranging from –100 (maximally right‑leaning) to +100 (maximally left‑leaning).
Eight LLMs were evaluated: commercial instruction‑tuned models (OpenAI GPT‑4o‑mini, Anthropic Claude 3.7 Sonnet, Google Gemini 2.5 Flash‑Lite), open‑source models (Meta Llama, DeepSeek‑v3.2, Alibaba Qwen‑3‑235b Base and its Instruction‑Tuned variant), and the “anti‑woke” model Grok‑4.1. The selection spans a spectrum of development philosophies and alignment strategies, allowing the authors to examine how alignment training (e.g., Qwen Base vs. Instruction‑Tuned) influences political orientation.
Results show that seven of the eight models fall in the modestly liberal range (+19 to +39), confirming the hypothesis that commercially developed LLMs tend toward left‑leaning values. Grok is the outlier with a score of –22.7, indicating a right‑leaning bias. Across the four stages, most models exhibit only slight fluctuations in alignment; no consistent trend of increasing left‑ or right‑bias is observed, though Grok and the base Qwen model display marginal stage‑wise shifts. In terms of reasoning style, the majority of models employ consequence‑based arguments, emphasizing outcomes and societal utility. Grok, by contrast, frequently grounds its responses in factual data and statistical evidence, reflecting a distinct “fact‑statistic” reasoning pattern.
The study contributes (1) a high‑signal, multi‑turn benchmark that captures dynamic value expression, (2) a quantitative framework for mapping nuanced political traits rather than binary labels, (3) empirical evidence of systematic liberal bias in mainstream LLMs and a right‑leaning profile for Grok, and (4) insight into how role‑play pressure can reveal latent value orientations.
Limitations include reliance on a single judge model that may itself carry bias, a scenario set focused primarily on Western policy issues (potentially limiting cross‑cultural generalizability), and the conflation of “political values” with concrete policy positions, which may obscure how models would handle specific legislative debates. Future work should expand scenario diversity, incorporate multiple independent judges to mitigate scoring bias, separate value assessment from policy stance evaluation, and compare LLM responses with human participants to gauge alignment with real‑world political cognition.
Comments & Academic Discussion
Loading comments...
Leave a Comment