Embedding model for the Malagasy informal language

Francis Rakotomalala; Aimé Richard Hajalalaina; Ndaohialy Manda Vy Ravonimanantsoa

doi:10.52846/stccj.2025.5.1.64

Authors

Francis Rakotomalala University of Fianarantsoa
Aimé Richard Hajalalaina University of Fianarantsoa
Ndaohialy Manda Vy Ravonimanantsoa University of Antananarivo

DOI:

https://doi.org/10.52846/stccj.2025.5.1.64

Keywords:

BERT, Embedding model, Malagasy language, Informal language

Abstract

Processing informal Malagasy language presents major challenges due to linguistic variations, abbreviations, and frequent code-switching in digital communication. This study proposes a text embedding model based on DistilBERT and XML-RoBERTa, specifically adapted to informal Malagasy. Through fine-tuning on custom corpora, we observe a gradual improvement in performance, with a significant reduction in loss function and lower perplexity, indicating a better understanding of linguistic structures. The evaluation shows that the generated embeddings effectively capture semantic similarities, even across varied formulations. DistilBERT outperforms XML-RoBERTa, demonstrating better generalization. These results highlight the importance of adapting language processing models to low-resource languages and open up new perspectives for applications in the automatic understanding of informal language.

References

D. Crystal, « Language and the Internet ». 2004.

J. Eisenstein, « What to do about bad language on the internet », in Proceedings of the 2013 conference of the North American Chapter of the association for computational linguistics: Human language technologies, 2013, p. 359‑369.

P. Joshi, S. Santy, A. Budhiraja, K. Bali, et M. Choudhury, « The State and Fate of Linguistic Diversity and Inclusion in the NLP World », 27 janvier 2021

S. Bird, « Decolonising speech and language technology », in 28th International Conference on Computational Linguistics, COLING 2020, Association for Computational Linguistics (ACL), 2020, p. 3504-3519.

V. Sanh, L. Debut, J. Chaumond, et T. Wolf, « DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter », 1 mars 2020

A. Conneau et al., « Unsupervised Cross-lingual Representation Learning at Scale », 8 avril 2020

T. Baldwin, P. Cook, M. Lui, A. MacKinlay, et L. Wang, « How noisy social media text, how diffrnt social media sources? », in Proceedings of the sixth international joint conference on natural language processing, 2013, p. 356-364.

C. M. Keet, « Bootstrapping NLP tools across low-resourced African languages: an overview and prospects », 21 octobre 2022

N. Aepli, « There Is Plenty of Room at the Bottom: Challenges & Opportunities in Low-Resource Non-Standardized Language Varieties », PhD Thesis, University of Zurich, 2024. Consulté le: 7 avril 2025. [En ligne]. Disponible sur: https://www.zora.uzh.ch/id/eprint/262877/1/Aepli_Noemi_Dissertation.pdf

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, et J. Dean, « Distributed representations of words and phrases and their compositionality », Advances in neural information processing systems, vol. 26, 2013, Consulté le: 7 avril 2025.

J. Pennington, R. Socher, et C. D. Manning, « Glove: Global vectors for word representation », in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, p. 1532-1543.

J. Devlin, M.-W. Chang, K. Lee, et K. Toutanova, « BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding », 24 mai 2019

A. Vaswani, « Attention is all you need », Advances in Neural Information Processing Systems, 2017.

A. A. Mary, P. Acharya, R. Rakshinee, et S. Jeyaseelan, « Enhancing Question Answer Generation from PDFs: A Fusion of BERT, RAKE, T5 and DistilBERT with RQUGE », in Proceedings of 5th International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2024, Volume 1, Springer Nature, p. 249.

A. F. Adoma, N.-M. Henry, et W. Chen, « Comparative analyses of bert, roberta, distilbert, and xlnet for text-based emotion recognition », in 2020 17th international computer conference on wavelet active media technology and information processing (ICCWAMTIP), IEEE, 2020, p. 117-121.

F. Rakotomalala, A. R. Hajalalaina, M. V. Ravonimanantsoa Ndaohialy, A. Andriavelonera Alexandre, et A. H. Ranaivoson, « FLICs (Facebook Language Informal Corpus): a novel dataset for informal language », Int J Data Sci Anal, vol. 18, no 4, p. 393-403, oct. 2024