Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants’ web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.
💡 Research Summary
The paper introduces a novel evaluation framework for web‑search‑enabled chat assistants that jointly measures (i) the credibility of the sources they cite and (ii) the groundedness of their generated responses with respect to those sources. Recognizing that large language models (LLMs) are increasingly equipped with live web search capabilities, the authors argue that while this can improve factuality, it also opens a pathway for the amplification of misinformation when low‑credibility or deliberately manipulated webpages are retrieved and cited.
To operationalize the framework, the authors curated 100 verifiable claims across five misinformation‑prone domains: health, climate change, the Russia‑Ukraine war, U.S. politics, and local issues (20 claims per domain). Each claim was presented to the assistants under two distinct user roles: a “Fact‑Checker” who seeks to verify the claim, and a “Claim Believer” who wants confirmation of the (often false) claim. This dual‑role design captures real‑world framing effects, as prior work shows that false presuppositions can bias LLMs toward accepting misinformation.
Four widely used chat assistants were evaluated: GPT‑4o, GPT‑5, Perplexity, and Qwen Chat. Interactions were performed through the public web interfaces and automated with Selenium to mimic genuine user behavior, ensuring that any safety filters or UI‑level citation mechanisms were retained. The authors extracted citations from each response, using highlight‑based mapping for GPT‑4o and HTML‑structure inference for the other systems.
Source credibility was assessed by mapping each cited domain to the Media Bias/Fact Check (MBFC) taxonomy and a curated list of fact‑checking organizations. Domains were classified into eight MBFC credibility levels (very high, high, mostly factual, mixed, low, very low, satire, not rated). Two aggregate metrics were computed per assistant: Credibility Rate (CR) – the proportion of cited sources deemed credible – and Non‑Credibility Rate (NCR) – the proportion of low‑credibility sources. Confidence intervals were derived using the Agresti‑Coull method.
Groundedness evaluation proceeded by decomposing each assistant’s answer into atomic factual units (sentences or clauses) and checking whether each unit was supported by the cited evidence. Crucially, the authors distinguished between “grounded in credible sources” and “grounded in non‑credible sources,” thereby exposing a failure mode where a response may be internally consistent yet rest on unreliable evidence.
Results reveal clear differences among the assistants. Perplexity achieved the highest overall CR (86.30 % ± 1.67) and the lowest NCR (0.69 % ± 0.32), indicating a cautious retrieval strategy that preferentially selects trustworthy domains. GPT‑4o and GPT‑5 displayed broader domain coverage but also higher exposure to low‑credibility sites, especially on the Russia‑Ukraine war topic where GPT‑4o’s NCR reached 4.55 %. Qwen Chat performed at a moderate level (CR ≈ 80 %, NCR ≈ 1 %).
User role analysis showed only minor variations: Fact‑Checker and Claim‑Believer prompts produced similar CR and NCR values, suggesting that framing had limited impact on source selection in this experimental setup, though Claim‑Believer prompts slightly increased NCR for GPT‑4o (2.73 % vs. 1.81 %).
Groundedness analysis uncovered that some assistants generated statements fully supported by non‑credible sources, and in a few cases mixed credible and non‑credible evidence within the same answer, potentially confusing users about the overall reliability. This highlights the importance of not only measuring factuality but also linking it to source quality.
The authors conclude that their dual‑metric framework provides a more nuanced picture of chat‑assistant reliability than traditional fact‑checking benchmarks. They recommend that developers incorporate real‑time credibility filtering into web‑search pipelines and make source trustworthiness transparent to end‑users. Future work could extend the framework to multilingual contexts, integrate dynamic credibility scores, and explore user‑feedback loops for adaptive citation behavior.
Comments & Academic Discussion
Loading comments...
Leave a Comment