Extending new language in NLLB-200: language informal Malagasy

Francis Rakotomalala; Aimé Richard  Hajalalaina; Ndaohialy Manda Vy Ravonimanantsoa

doi:10.52846/stccj.2025.5.1.70

Authors

Francis Rakotomalala Université de Fianarantsoa
Aimé Richard Hajalalaina University of Fianarantsoa
Ndaohialy Manda Vy Ravonimanantsoa University of Antananarivo

DOI:

https://doi.org/10.52846/stccj.2025.5.1.70

Keywords:

Language Informal Malagasy, Machine translation, NLLB-200

Abstract

This study focuses on integrating informal Malagasy into the NLLB-200 model for machine translation. The model underwent supervised pretraining, which quickly led to improved performance, marked by a significant reduction in both loss and perplexity. This step allowed the model to effectively adapt to the unique linguistic structures of Malagasy. The evaluation of key translation metrics such as BLEU, ROUGE, and BertScore showed that the model produces high-quality translations, combining fluency with semantic coherence. Although the BLEU score was moderate, the ROUGE and BertScore results revealed a remarkable level of lexical and semantic fidelity. This work highlights the importance of developing translation systems that can handle low-resource languages, which are often overlooked by traditional technologies. The study also demonstrates the model’s ability to grasp the nuances of informal Malagasy, resulting in significant improvements over existing translation tools. In conclusion, this approach emphasizes the need to include informal languages in translation systems, paving the way for more inclusive and linguistically tailored applications.

References

Y. Liu et M. Lapata, « mBART: multilingual denoising pre-training for neural machine translation », in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, p. 7871-7880.

N. Team et al., « No Language Left Behind: Scaling Human-Centered Machine Translation », 25 août 2022.

E. M. Bender, T. Gebru, A. McMillan-Major, et S. Shmitchell, « On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 », in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event Canada: ACM, mars 2021, p. 610-623.

S. Jin, A. P. de Vries, A. Szuba, et D. Hiemstra, « Classification and Interchange of Informal and Formal English Text », 2022.theses/2022/Seraph_Jin___1032019___Classification_and_Interchange_of_Informal_and_Formal_English_Text.pdf

C. Zhao et al., « A Systematic Review of Cross-Lingual Sentiment Analysis: Tasks, Strategies, and Prospects », ACM Comput. Surv., vol. 56, no 7, p. 1-37, juill. 2024.

C. Raffel et al., « Exploring the limits of transfer learning with a unified text-to-text transformer », Journal of machine learning research, vol. 21, no 140, p. 1-67, 2020.

R. Sennrich, B. Haddow, et A. Birch, « Improving Neural Machine Translation Models with Monolingual Data », 3 juin 2016.

S. Edunov, M. Ott, M. Auli, et D. Grangier, « Understanding Back-Translation at Scale », 3 octobre 2018.

T. Kudo et J. Richardson, « SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing », 19 août 2018.

P. Michel et G. Neubig, « Extreme Adaptation for Personalized Neural Machine Translation », 4 mai 2018.

J. Tiedemann, « OPUS-Parallel Corpora for Everyone. », Baltic Journal of Modern Computing, vol. 4, no 2, 2016