Important Dates
Paper submission due | |
Notification of acceptance | December 6 (Fri), 2024 |
Camera-ready due | December 13 (Tue), 2024 |
Workshop | January 20 (Mon), 2025 co-located with COLING 2025 |
* These dates are approximate dates based on COLING 2025 and are subject to changes. |
Neural language models have revolutionised natural language processing (NLP) and have provided state-of-the-art results for many tasks. However, their effectiveness is largely dependent on the pre-training resources. Therefore, language models (LMs) often struggle with low-resource languages in both training and evaluation. Recently, there has been a growing trend in developing and adopting LMs for low-resource languages. This workshop aims to provide a forum for researchers to share and discuss their ongoing work on LMs for low-resource languages.
Background
Globally, there are approximately 7,000 spoken languages (van Esch et al., 2022), yet most NLP research focuses only on about 20 languages with high resources (Magueresse et al., 2020). The remaining numerous languages that receive little research attention are commonly known as low-resource languages. Even though these languages represent significant global communities, they generally lack sufficient digital data and resources to support NLP tasks or benefit from recent advancements in the field (Ruder et al., 2022).
Neural language models, particularly transformers and large language models (LLMs), have revolutionised NLP, achieving state-of-the-art results in many tasks (Touvron et al., 2023; Minaee et al., 2024). However, since the capabilities of language models (LMs) are also primarily determined by the characteristics of their pre-trained language corpora, disparities in language resources are evident within the models. Therefore, LMs often struggle with low-resource languages in training and evaluation despite their strong performance with high-resource languages (Blasi et al., 2022).
Following this bias in NLP approaches towards high-resource languages, which negatively affects a significant portion of the global community, there has been a growing trend in developing and adopting LMs for low-resource languages to promote linguistic fairness. To support and strengthen this movement, this workshop aims to provide a forum for researchers to share and discuss their ongoing work on LMs for low-resource languages. We mainly target to encourage the development of LM-based approaches and compile a research collection that supports ongoing and future research in this area, building on recent advancements in LMs.
References
Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. 2022. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics
Alexandre Magueresse, Vincent Carles, and Evan Heet- derks. 2020. Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264.
Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196.
Sebastian Ruder, Ivan Vulic, and Anders Søgaard. ´ 2022. Square one bias in NLP: Towards a multidimensional exploration of the research manifold. In Findings of the Association for Computational Linguistics: ACL 2022 , pages 2340–2354, Dublin, Ireland. Association for Computational Linguistics.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
Daan van Esch, Tamar Lucassen, Sebastian Ruder, Isaac Caswell, and Clara Rivera. 2022. Writing system and speaker metadata for 2,800+ language varieties. In Proceedings of the Thirteenth Language Resources and Evaluation Conference , pages 5035–5046, Marseille, France. European Language Resources Association
Topics
The workshop invites submissions on a broad range of topics related to the development and evaluation of neural language models for low-resource languages, including but not limited to the following.
- Building language models for low-resource languages.
- Adapting/extending existing language models/large language models for low-resource languages.
- Corpora creation and curation technologies for training language models/large language models for low-resource languages.
- Benchmarks to evaluate language models/large language models in low-resource languages.
- Prompting/in-context learning strategies for low-resource languages with large language models.
- Review of available corpora to train/fine-tune language models/large language models for low-resource languages.
- Multilingual/cross-lingual language models/large language models for low-resource languages.
- Applications of language models/large language models for low-resource languages (i.e. machine translation, chatbots, content moderation, etc.
Supported by
The workshop is supported in part by CLARIN-UK, funded by the Arts and Humanities Research Council as part of the Infrastructure for Digital Arts and Humanities programme.
Contact us
Stay in touch to receive updates about LoResLM 2025