Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

December 26, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis
ArXiv ID: 2512.22100
Date: 2025-12-26
Authors: Duygu Altinok

📝 Abstract

Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.

💡 Deep Analysis

📄 Full Content

Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis Duygu Altinok Independent Researcher, Berlin, Germany. Contributing authors: duygu@turkish-nlp-suite.com; Abstract Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehen- sive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evalua- tion and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and eval- uation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design pri- oritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valu- able resources, and provide insights into generating high-quality semi-automated datasets. Keywords: Turkish NLU, Turkish NLU benchmark, Turkish NLI, Turkish sentiment analysis, Turkish sentiment analysis datasets 1 arXiv:2512.22100v1 [cs.CL] 26 Dec 2025 1 Introduction Given the advancements in transformer models, such as BERT [1] and large language models (LLMs) [2], it has become essential to benchmark these models from various perspectives. Evaluating natural language understanding (NLU) is particularly impor- tant, and having a benchmark for evaluating and analyzing NLU abilities is crucial. The General Language Understanding Evaluation (GLUE) benchmark [3] has been widely used for English. However, benchmarks for languages other than English have also been developed, including CLUE for Chinese [4], FLUE for French [5], JGLUE for Japanese [6], and KLUE for Korean [7]. In the case of Turkish, several datasets have been created for tasks like text classifi- cation, sentiment analysis [8], and hate speech [9]. However, there is a lack of datasets for paraphrase, similarity, and inference tasks, with the exception of a Semantic Tex- tual Similarity (STS) dataset translated by Beken et al. [10]. As a result, there is currently no standard NLU benchmark for Turkish, and the existing datasets are scat- tered and not comprehensive. To address this gap, we have developed a centralized and easily accessible benchmark called TrGLUE, which is hosted on Hugging Face. Despite the availability of transformer models like BERTurk [11] and even some LLMs, there is a scarcity of training and evaluation datasets in Turkish. Most models are trained on OSCAR [12, 13] and/or mC4 corpora [14], which lack variability, and the situation is even worse for evaluation datasets. BERTurk has been evaluated in limited areas such as sequence classification, named entity recognition (NER), part- of-speech (POS) tagging, and question answering. However, there is little information available about the construction process or corpus statistics of the question answering dataset used by BERTurk. Existing evaluations in the field lack comprehensive coverage and reproducibility, either focusing superficially on a few dimensions or relying on local datasets that are not shared. Moreover, the absence of a standardized benchmarking dataset for Turkish creates challenges in comparing results across different models. Another issue is using translations instead of building from natural text data, which introduces low-quality translations, cultural biases as well as not considering agglutinative nature of Turkish, also introducing evaluation metric defects. To address these issues, we propose the creation of the Turkish General Language Understanding Evaluation (TrGLUE) benchmark. Similar to the widely used GLUE benchmark for English, TrGLUE aims to establish a standardized evaluation frame- work for Turkish NLP models. It comprises diverse tasks of text classification and sentence pair classification, carefully designed to assess various aspects of nat

📄 Read Full PDF on ArXiv