Classifier-Based Text Simplification for Improved Machine Translation
Machine Translation is one of the research fields of Computational Linguistics. The objective of many MT Researchers is to develop an MT System that produce good quality and high accuracy output translations and which also covers maximum language pairs. As internet and Globalization is increasing day by day, we need a way that improves the quality of translation. For this reason, we have developed a Classifier based Text Simplification Model for English-Hindi Machine Translation Systems. We have used support vector machines and Na"ive Bayes Classifier to develop this model. We have also evaluated the performance of these classifiers.
💡 Research Summary
The paper addresses the need for improving machine translation (MT) quality in the context of increasing internet usage and globalization, focusing on English‑Hindi translation. It proposes a classifier‑based text simplification (TS) pipeline that first generates a simplified English version of a complex source sentence using a phrase‑based MT system (Moses). The original complex sentence and its simplified counterpart are then fed to a binary classifier that decides whether the simplification preserves the original meaning (label “Yes”) or not (“No”). If the simplification is judged good, the simplified sentence is sent to the English‑Hindi MT engine; otherwise, the original sentence is translated directly.
To train the classifiers, the authors built a dataset of 3,000 complex English sentences, generated simplified English outputs, and obtained human annotations indicating meaning preservation. They extracted 17 features for each sentence pair, covering surface statistics (token counts, punctuation), language model probabilities for source and target trigrams, frequency‑based measures (low/high‑frequency words, bigrams, trigrams), and lexical coverage. These features are similar to those used in prior work on readability assessment and translation quality estimation.
Two classic machine‑learning models were trained: a Naïve Bayes classifier, which assumes feature independence and is computationally lightweight, and a Support Vector Machine (SVM) with a linear kernel, which seeks a hyperplane that separates the “good” and “bad” simplifications in the high‑dimensional feature space. Both models were evaluated against a separate human‑annotated test set of 300 sentences. Standard metrics—precision, recall, F‑measure—were computed, as well as regression‑style errors (Mean Absolute Error, Root Mean Square Error) and Cohen’s kappa to assess agreement with human judgments.
Results show that the Naïve Bayes classifier slightly outperforms the SVM across all metrics: precision 0.562 vs. 0.527, recall 0.565 vs. 0.534, F‑measure 0.563 vs. 0.525, MAE 0.461 vs. 0.466, RMSE 0.517 vs. 0.682, and kappa 0.518 vs. 0.445. Confusion matrix analysis confirms higher agreement with human decisions for Naïve Bayes (1,694 matches, ≈56% of cases) compared to SVM (1,603 matches, ≈53%). The authors interpret these findings as evidence that a simple probabilistic model can effectively filter out poor simplifications, thereby improving downstream translation quality without requiring large computational resources.
The paper concludes that classifier‑based TS is a viable pre‑processing step for English‑Hindi MT, with Naïve Bayes offering a marginal advantage in this experimental setting. It acknowledges limitations such as the modest dataset size and the focus on a single language pair. Future work is suggested to explore larger, multilingual corpora, incorporate deep‑learning‑derived sentence embeddings as additional features, and experiment with multi‑class labeling (e.g., partial preservation) to refine the quality assessment further.
Comments & Academic Discussion
Loading comments...
Leave a Comment