Approaches to Semantic Textual Similarity in Slovak Language: From Algorithms to Transformers

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Semantic textual similarity (STS) plays a crucial role in many natural language processing tasks. While extensively studied in high-resource languages, STS remains challenging for under-resourced languages such as Slovak. This paper presents a comparative evaluation of sentence-level STS methods applied to Slovak, including traditional algorithms, supervised machine learning models, and third-party deep learning tools. We trained several machine learning models using outputs from traditional algorithms as features, with feature selection and hyperparameter tuning jointly guided by artificial bee colony optimization. Finally, we evaluated several third-party tools, including fine-tuned model by CloudNLP, OpenAI’s embedding models, GPT-4 model, and pretrained SlovakBERT model. Our findings highlight the trade-offs between different approaches.

💡 Research Summary

The paper presents a comprehensive evaluation of semantic textual similarity (STS) methods for Slovak, a low‑resource language, covering three major categories: traditional algorithms, supervised machine‑learning models built on traditional algorithm outputs, and modern deep‑learning tools. For the traditional baseline, the authors implement a wide range of string‑based (e.g., Levenshtein, Jaro‑Winkler), term‑based (Jaccard, Ochiai), statistical (HAL, ESA, DISCO, FastText), and knowledge‑based (Wu‑Palmer, Leacock‑Chodorow) approaches. Experiments on machine‑translated Slovak versions of the STS Benchmark and SICK datasets show that term‑based methods achieve the highest Pearson correlations among the classic families, with Ochiai reaching 0.58 on the benchmark. Statistical methods perform competitively when high‑quality word vectors are used; OpenAI word‑level embeddings obtain the best statistical score (≈0.55). Knowledge‑based algorithms lag far behind (≈0.18‑0.28), likely due to the limited coverage of the Slovak WordNet.

In the machine‑learning stage, the outputs of all traditional algorithms are treated as features for regression models: linear, Bayesian ridge, support‑vector, decision‑tree, random‑forest, Gradient Boosting, and XGBoost. Feature selection and hyper‑parameter tuning are performed jointly using an Artificial Bee Colony (ABC) optimizer, with a 10‑fold cross‑validation fitness function. Each model‑dataset pair is optimized over 30 iterations with a population of 50 agents. Gradient Boosting Regression achieves the best correlations (0.685 on STS Benchmark, 0.702 on SICK), closely followed by XGBoost (0.678/0.696). XGBoost also offers shorter training times, making it a practical choice. Overall, the ML models improve over raw traditional scores by 0.07‑0.12 Pearson points, confirming that the algorithmic outputs contain useful signal when combined intelligently.

The third evaluation tier examines third‑party tools and large language models. OpenAI’s text‑embedding‑3‑large model, when applied at the sentence level with cosine similarity, yields 0.756 (STS Benchmark) and 0.718 (SICK). GPT‑4, prompted to output a numeric similarity score on a 0‑5 scale, outperforms all embeddings with 0.780 and 0.740 respectively. The commercial NLPCloud service, based on a fine‑tuned multilingual MPNet‑Base‑V2 (sentence‑BERT) architecture, achieves the highest scores of 0.824 and 0.778. An open‑source Slovak‑BERT model, fine‑tuned on a portion of the STS Benchmark, reaches ≈0.75 on both datasets, comparable to the best OpenAI embeddings but without licensing costs.

The authors synthesize these findings into practical guidance. Traditional methods are inexpensive and easy to deploy but limited in accuracy, especially knowledge‑based techniques that suffer from sparse lexical resources. Supervised ML models add a modest performance boost at the cost of additional optimization effort. Large pre‑trained models and commercial APIs deliver the strongest correlations, yet they introduce financial, latency, and data‑privacy considerations. Consequently, the optimal solution depends on the specific application’s resource constraints, required throughput, and interpretability needs. Future work is suggested to expand Slovak‑specific lexical resources, explore multilingual transfer learning, and investigate more efficient fine‑tuning strategies for domain‑adapted transformers.

Approaches to Semantic Textual Similarity in Slovak Language: From Algorithms to Transformers

💡 Research Summary

Comments & Academic Discussion

Leave a Comment