Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have transformed sentiment analysis, yet balancing accuracy, efficiency, and explainability remains a critical challenge. This study presents the first comprehensive evaluation of DeepSeek-R1–an open-source reasoning model–against OpenAI’s GPT-4o and GPT-4o-mini. We test the full 671B model and its distilled variants, systematically documenting few-shot learning curves. Our experiments show DeepSeek-R1 achieves a 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with just 5 shots, an eightfold improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant by 6.69 percentage points. While its reasoning process reduces throughput, DeepSeek-R1 offers superior explainability via transparent, step-by-step traces, establishing it as a powerful, interpretable open-source alternative.


💡 Research Summary

This paper presents a comprehensive evaluation of DeepSeek‑R1, an open‑source reasoning‑oriented large language model, against two proprietary OpenAI models (GPT‑4o and GPT‑4o‑mini) across three sentiment‑analysis benchmarks: 5‑class Amazon Reviews, binary IMDB Movie Reviews, and 27‑class GoEmotions. The authors test the full 671‑billion‑parameter DeepSeek‑R1 model as well as four distilled variants (8 B, 14 B, 32 B, 70 B) and systematically vary the number of in‑context examples from 0 to 50 shots. All models are prompted with a unified system prompt that requests a sentiment label and a brief explanation in JSON format, ensuring a fair comparison of both predictive performance and explainability.

Key findings include: (1) Few‑shot efficiency – DeepSeek‑R1 reaches 91.39 % weighted F1 on the 5‑class task and 99.31 % accuracy on the binary task with only five examples, an eight‑fold improvement over GPT‑4o, which requires around 40 examples to achieve comparable scores. (2) Architecture‑driven distillation – The 32 B Qwen2.5‑based distilled model outperforms the larger 70 B Llama‑based variant by 6.69 percentage points, demonstrating that newer architectures can yield superior performance even with fewer parameters. (3) Throughput trade‑off – Because DeepSeek‑R1 generates step‑by‑step reasoning traces (≈730 tokens of intermediate text) before emitting the final JSON, its token‑per‑second throughput peaks at 334 t/s, far below GPT‑4o‑mini’s 2,124 t/s and GPT‑4o’s 1,220 t/s. (4) Explainability – The native reasoning output provides transparent, human‑readable justification for each prediction, enabling direct verification of model decisions. In contrast, GPT‑4o only offers a summarized rationale, limiting post‑hoc analysis.

Error‑analysis via confusion matrices shows that DeepSeek‑R1 handles intensity modifiers (e.g., “Strongly Positive” vs. “Positive”) with bidirectional, symmetric confusion, whereas GPT‑4o adopts a conservative strategy that rarely upgrades sentiment intensity. This suggests that the reasoning‑centric architecture of DeepSeek‑R1 captures finer semantic nuances.

Methodologically, the authors split each dataset into 70 % training and 30 % testing using a fixed random seed (42) and ensure class‑balanced exemplar selection for few‑shot experiments. All models use identical inference settings (temperature = 1.0, top‑p = 1.0) and are evaluated on accuracy, weighted F1, mean evaluation time, and token throughput.

The paper concludes that DeepSeek‑R1 offers a compelling combination of high predictive performance, strong few‑shot learning capability, and built‑in explainability, making it attractive for domains where labeled data are scarce and model transparency is required (e.g., finance, content moderation). However, its higher computational cost due to extensive reasoning output may limit deployment in latency‑sensitive or large‑scale batch scenarios. Future work could explore token‑compression, hybrid summarization of reasoning, or more efficient prompting strategies to retain explainability while improving throughput. Additionally, the observed superiority of Qwen2.5‑based distilled models points to the importance of architecture selection in model compression pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment