PersoBench: Benchmarking Personalized Response Generation in Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While large language models (LLMs) have exhibited impressive conversational capabilities, their proficiency in delivering personalized responses remains unclear. Although recent benchmarks automatically evaluate persona consistency in role-playing contexts using LLM-based judgment, the evaluation of personalization in response generation remains underexplored. To address this gap, we present an automated benchmarking pipeline, PersoBench, to evaluate the personalization ability of LLMs in persona-aware dialogue generation within a zero-shot setting. Our framework employs a structured pipeline comprising speaker-aware annotation, task-specific and context-driven prompt construction, response post-processing, and automated evaluation across multiple dimensions of generation quality. In particular, the pipeline performs text preprocessing and speaker labeling, constructs structured prompts with task instructions and LLM roles, validates response format, and evaluates valid outputs across fluency, personalization, diversity, and coherence. We assess the performance of four open-source and four closed-source LLMs using well-known datasets and a range of explicit metrics. Our findings reveal that while LLMs excel at generating fluent and diverse responses, they are far from satisfactory in delivering personalized and coherent responses, considering both the conversation context and the provided personas.

💡 Research Summary

The paper introduces PersoBench, an automated benchmarking pipeline designed to evaluate the ability of large language models (LLMs) to generate personalized responses in dialogue settings. While prior benchmarks such as RPBench‑Auto, TIMECHARA, and RoleLLM focus on role‑playing consistency, PersoBench targets the more practical problem of aligning model outputs with user‑provided personas and the immediate conversational context.

The pipeline consists of four main stages. First, speaker‑aware annotation preprocesses raw dialogue logs and persona descriptions, attaching explicit speaker labels to each utterance. Second, task‑specific prompts are constructed that supply the conversation history, the set of persona attributes, and a clear instruction to produce a personalized response. In the Chain‑of‑Thought (CoT) variant, the prompt additionally asks the model to provide a brief reasoning snippet (≤110 words) describing how the persona was incorporated, and it enforces a JSON output format containing both “reasoning” and “response”. Third, response post‑processing parses the JSON, validates format compliance, and extracts the textual answer for downstream scoring. Fourth, automated evaluation applies eight metrics across four dimensions: fluency (e.g., perplexity, grammaticality), diversity (lexical and semantic variance), coherence (contextual relevance and logical flow), and personalization (Persona‑F1, AlignScore, and LLM‑as‑Judge judgments).

To test the pipeline, the authors selected three well‑known persona‑aware dialogue datasets—Blended Skill Talk (BST), Follow‑up Customized Conversation (FoCus), and IT‑ConVAI2—covering open‑domain interests, structured trait sets, and task‑specific roles, respectively. The combined corpus contains roughly 3,600 samples. Eight LLMs were evaluated: four open‑source models (LLaMA‑2, Mistral‑7B, Falcon‑180B, OpenChat) and four closed‑source models (GPT‑4, Claude‑2, Gemini‑Pro, Llama‑2‑Chat). All experiments were conducted in a zero‑shot setting to capture the models’ inherent capabilities without few‑shot exemplars or extensive prompt engineering. Both vanilla prompts and CoT prompts were tested for each model.

Results show that all models achieve high fluency and diversity scores, indicating that current LLMs are proficient at producing grammatically correct and varied language. However, personalization scores are consistently low across the board, especially when multiple persona attributes must be reconciled simultaneously. CoT prompting yields modest improvements in coherence for some models but does not substantially raise personalization metrics. Closed‑source models generally outperform open‑source counterparts in instruction adherence and response latency (average 0.8 s vs. 1.5–2.3 s). Open‑source models exhibit higher rates of format violations and slower generation, suggesting practical deployment challenges.

The authors claim three primary contributions: (1) a dedicated, reproducible benchmark for personalized response generation, (2) an empirical analysis of how CoT reasoning influences coherence and instruction compliance, and (3) a comprehensive set of evaluation dimensions that includes both linguistic quality and persona alignment. Limitations include reliance on automatic metrics that may not fully capture nuanced human judgments, a focus on English‑language datasets, and the absence of multi‑modal persona information.

Future work is suggested in three directions: extending PersoBench to multilingual and multimodal personas, integrating human‑in‑the‑loop validation to calibrate automatic scores, and developing model architectures or training objectives that more effectively fuse persona embeddings with conversational context. By making the code and evaluation scripts publicly available, the paper invites the community to build upon this foundation and advance the state of personalized dialogue systems.

PersoBench: Benchmarking Personalized Response Generation in Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment