Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses
Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model’s predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.
💡 Research Summary
The paper addresses the long‑standing challenge of automatically evaluating open‑domain dialogue responses, a task for which traditional word‑overlap metrics such as BLEU, METEOR, and ROUGE perform poorly. These metrics were originally designed for machine translation or summarization, where a limited set of correct outputs exists. In contrast, a conversational context admits a large, diverse set of appropriate replies, making lexical overlap an unreliable proxy for quality. Consequently, researchers still rely on costly human judgments, which hampers rapid prototyping and systematic comparison of dialogue models.
To overcome this limitation, the authors formulate dialogue evaluation as a supervised learning problem. They collect a new dataset of human appropriateness scores for Twitter‑derived conversational turns. Candidate responses are drawn from four sources: a TF‑IDF retrieval baseline, a Dual Encoder model, a Hierarchical Recurrent Encoder‑Decoder (HRED) generator, and human‑written replies. Each (context, model response, reference response) triple is annotated on a 1‑5 Likert scale by Amazon Mechanical Turk workers, yielding 4,104 examples with an inter‑annotator Pearson correlation of 0.63. The dataset is split into 2,872 training, 616 validation, and 616 test instances.
The proposed Automatic Dialogue Evaluation Model (ADEM) encodes the three textual elements—dialogue context c, model response r̂, and reference response r—using a hierarchical recurrent neural network (utterance‑level LSTM feeding a context‑level LSTM). The encoder is first pretrained as part of a Variational HRED (VHRED) dialogue generator on large‑scale unlabelled data, which provides robust semantic embeddings. After pretraining, the final hidden states are optionally reduced via PCA to a manageable dimensionality n.
ADEM predicts a score through a simple bilinear formulation:
score(c, r̂, r) = (cᵀ M r̂ + rᵀ N r̂ − α) / β
where M and N are learnable matrices (initialized as identity) that project the model response into the context and reference spaces, respectively. Scalars α and β are set to map raw outputs into the
Comments & Academic Discussion
Loading comments...
Leave a Comment