PDF
(2827KB)
Abstract
Based on the foundation of Large Language Models (LLMs), Multilingual LLMs (MLLMs) have been developed to address the challenges faced in multilingual natural language processing, hoping to achieve knowledge transfer from high-resource languages to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolutions, key techniques, and multilingual capacities. Secondly, we explore the multilingual training corpora of MLLMs and the multilingual datasets oriented for downstream tasks that are crucial to enhance the cross-lingual capability of MLLMs. Thirdly, we survey the state-of-the-art studies of multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs, including its categories, evaluation metrics, and debiasing techniques. Finally, we discuss existing challenges and point out promising research directions of MLLMs.
Graphical abstract
Keywords
multilingual large language model
/
corpora
/
alignment
/
bias
/
survey
Cite this article
Download citation ▾
Yuemei XU, Ling HU, Jiayi ZHAO, Zihan QIU, Kexin XU, Yuqi YE, Hanwen GU.
A survey on multilingual large language models:corpora, alignment, and bias.
Front. Comput. Sci., 2025, 19(11): 1911362 DOI:10.1007/s11704-024-40579-4
| [1] |
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010
|
| [2] |
Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 4171–4186
|
| [3] |
Conneau A, Lample G. Cross-lingual language model pretraining. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 634
|
| [4] |
Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. mT5: A massively multilingual pre-trained text-to-text transformer. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 483–498
|
| [5] |
Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S et al, . BLOOM: A 176B-parameter open-access multilingual language model. 2022, arXiv preprint arXiv: 2211.05100
|
| [6] |
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971
|
| [7] |
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8440–8451
|
| [8] |
Cao S, Kitaev N, Klein D. Multilingual alignment of contextual word representations. In: Proceedings of the 8th International Conference on Learning Representations. 2020
|
| [9] |
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119
|
| [10] |
Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543
|
| [11] |
Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? In: Proceedings of 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, 610–623
|
| [12] |
Talat Z, Névéol A, Biderman S, Clinciu M, Dey M, Longpre S, Luccioni S, Masoud M, Mitchell M, Radev D, Sharma S, Subramonian A, Tae J, Tan S, Tunuguntla D, Van Der Wal O. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In: Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. 2022, 26–41
|
| [13] |
Hutchinson B, Prabhakaran V, Denton E, Webster K, Zhong Y, Denuyl S. Social biases in NLP models as barriers for persons with disabilities. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5491–5501
|
| [14] |
Nadeem M, Bethke A, Reddy S. StereoSet: measuring stereotypical bias in pretrained language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 5356–5371
|
| [15] |
Le H, Vial L, Frej J, Segonne V, Coavoux M, Lecouteux B, Allauzen A, Crabbé B, Besacier L, Schwab D. FlauBERT: unsupervised language model pre-training for French. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020, 2479–2490
|
| [16] |
De Vries W, Van Cranenburgh A, Bisazza A, Caselli T, Van Noord G, Nissim M. BERTje: A Dutch BERT model. 2019, arXiv preprint arXiv: 1912.09582
|
| [17] |
Antoun W, Baly F, Hajj H. AraBERT: Transformer-based model for Arabic language understanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 2020, 9–15
|
| [18] |
Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI Blog, 2018
|
| [19] |
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9
|
| [20] |
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J. et al, . Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159
|
| [21] |
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L et al, . Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011
|
| [22] |
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I et al, . Gpt-4 technical report. 2023, arXiv preprint arXiv: 2303.08774
|
| [23] |
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J . Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21( 1): 140
|
| [24] |
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7871–7880
|
| [25] |
Nguyen T Q, Chiang D. Transfer learning across low-resource, related languages for neural machine translation. In: Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017, 296–301
|
| [26] |
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L. Multilingual denoising pre-training for neural machine translation. In: Proceedings of Transactions of the Association for Computational Linguistics. 2020, 726–742
|
| [27] |
Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 4996–5001
|
| [28] |
Artetxe M, Ruder S, Yogatama D. On the cross-lingual transferability of monolingual representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 4623–4637
|
| [29] |
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G et al . . PaLM: Scaling language modeling with pathways. The Journal of Machine Learning Research, 2023, 24( 1): 240
|
| [30] |
Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A. et al, . LaMDA: language models for dialog applications. 2022, arXiv preprint arXiv: 2201.08239
|
| [31] |
Zhang S, Roller S, Goyal N, Artetxe M, Chen M et al, . OPT: open pre-trained transformer language models. 2022, arXiv preprint arXiv: 2205.01068
|
| [32] |
Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J. GLM: general language model pretraining with autoregressive blank infilling. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 320–335
|
| [33] |
Zeng A, Liu X, Du Z, Wang Z, Lai H, Ding M, Yang Z, Xu Y, Zheng W, Xia X, Tam W L, Ma Z, Xue Y, Zhai J, Chen W, Liu Z, Zhang P, Dong Y, Tang J. GLM-130B: an open bilingual pre-trained model. In: Proceedings of the 11th International Conference on Learning Representations. 2023
|
| [34] |
Chiang W L, Li Z, Lin Z, Sheng Y, Wu Z, Zhang H, Zheng L, Zhuang S, Zhuang Y, Gonzalez J E et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See vicuna. lmsys. org websit, 2023
|
| [35] |
Anil R, Borgeaud S, Alayrac J B, Yu J, Soricut R. et al, . Gemini: a family of highly capable multimodal models. 2023, arXiv preprint arXiv: 2312.11805
|
| [36] |
Rust P, Pfeiffer J, Vulić I, Ruder S, Gurevych I. How good is your tokenizer? On the monolingual performance of multilingual language models. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 3118–3135
|
| [37] |
Zhang D, Yu Y, Dong J, Li C, Su D, Chu C, Yu D. MM-LLMs: recent advances in MultiModal large language models. In: Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024. 2024, 12401–12430
|
| [38] |
Rae J W, Borgeaud S, Cai T, Millican K, Hoffmann J. et al, . Scaling language models: Methods, analysis & insights from training gopher. 2021, arXiv preprint arXiv: 2112.11446
|
| [39] |
Chung H W, Hou L, Longpre S, Zoph B, Tay Y. et al . . Scaling instruction-finetuned language models. Journal of Machine Learning Research, 2024, 25( 70): 1–53
|
| [40] |
OpenAI. Introducing chatGPT. See openai.com/index/chatgpt/ website, 2022
|
| [41] |
Driess D, Xia F, Sajjadi M S M, Lynch C, Chowdhery A. et al, . PaLM-E: An embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 8469–8488
|
| [42] |
Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C, Liang P, Hashimoto T B. Stanford alpaca: An instruction-following llama model. See github. com/tatsulab/stanford_alpaca website, 2023
|
| [43] |
Ren X, Zhou P, Meng X, Huang X, Wang Y, Wang W, Li P, Zhang X, Podolskiy A, Arshinov G, Bout A, Piontkovskaya I, Wei J, Jiang X, Su T, Liu Q, Yao J. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing. 2023, arXiv preprint arXiv: 2303.10845
|
| [44] |
Biderman S, Schoelkopf H, Anthony Q G, Bradley H, O’Brien K, Hallahan E, Khan M A, Purohit S, Prashanth U S, Raff E, Skowron A, Sutawika L, Van Der Wal O. Pythia: a suite for analyzing large language models across training and scaling. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 2397–2430
|
| [45] |
Anil R, Dai A M, Firat O, Johnson M, Lepikhin D. et al, . PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403
|
| [46] |
Touvron H, Martin L, Stone K, Albert P, Almahairi A. et al, . Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288
|
| [47] |
Manyika J, Hsiao S. An overview of bard: an early experiment with generative AI. See ai.google/static/documents/google-about-bard.pdf Google Static Documents, 2023
|
| [48] |
Yang A, Xiao B, Wang B, Zhang B, Bian C. et al, . Baichuan 2: Open large-scale language models. 2023, arXiv preprint arXiv: 2309.10305
|
| [49] |
MICROSOFT. Phi-2: the surprising power of small language models. See microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ website, 2023
|
| [50] |
Zeng A, Xu B, Wang B, Zhang C, Yin D. et al, . ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools. 2024, arXiv preprint arXiv: 2406.12793
|
| [51] |
Anthropic. The Claude 3 model family: Opus, sonnet, haiku. See anthropic.com/news/claude-3-family/ website, 2024
|
| [52] |
Dubey A, Jauhri A, Pandey A, Kadian A, Al-Dahle A. et al, . The llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783
|
| [53] |
Zhao W X, Zhou K, Li J, Tang T, Wang X. et al, . A survey of large language models. 2023, arXiv preprint arXiv: 2303.18223
|
| [54] |
Doddapaneni S, Ramesh G, Kunchukuttan A, Kumar P, Khapra M M. A primer on pretrained multilingual language models. 2021, arXiv preprint arXiv: 2107.00676
|
| [55] |
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X . Pre-trained models for natural language processing: a survey. Science China Technological Sciences, 2020, 63( 10): 1872–1897
|
| [56] |
Shen T, Jin R, Huang Y, Liu C, Dong W, Guo Z, Wu X, Liu Y, Xiong D. Large language model alignment: A survey. 2023, arXiv preprint arXiv: 2309.15025
|
| [57] |
Glaese A, McAleese N, Trębacz M, Aslanides J, Firoiu V. et al, . Improving alignment of dialogue agents via targeted human judgements. 2022, arXiv preprint arXiv: 2209.14375
|
| [58] |
Bai Y, Jones A, Ndousse K, Askell A, Chen A. et al, . Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022, arXiv preprint arXiv: 2204.05862
|
| [59] |
Liu R, Zhang G, Feng X, Vosoughi S. Aligning generative language models with human values. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 241–252
|
| [60] |
Baheti A, Lu X, Brahman F, Le Bras R, Sap M, Riedl M O. Improving language models with advantage-based offline policy gradients. 2023, arXiv preprint arXiv: 2305.14718
|
| [61] |
Go D, Korbak T, Kruszewski G, Rozen J, Ryu N, Dymetman M. Aligning language models with preferences through f-divergence minimization. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 463
|
| [62] |
Askell A, Bai Y, Chen A, Drain D, Ganguli D. et al, . A general language assistant as a laboratory for alignment. 2021, arXiv preprint arXiv: 2112.00861
|
| [63] |
Lambert N, Castricato L, Werra V L, Havrilla A. Illustrating reinforcement learning from human feedback (RLHF). See huggingface.co/blog/rlhf website, 2022
|
| [64] |
Stiennon N, Ouyang L, Wu J, Ziegler D M, Lowe R, Voss C, Radford A, Amodei D, Christiano P. Learning to summarize from human feedback. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 253
|
| [65] |
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347
|
| [66] |
Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T P, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
|
| [67] |
French R M . Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 1999, 3( 4): 128–135
|
| [68] |
Hedderich M A, Lange L, Adel H, Strötgen J, Klakow D. A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 2545–2568
|
| [69] |
Alabi J O, Adelani D I, Mosbach M, Klakow D. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 4336–4349
|
| [70] |
Wongso W, Lucky H, Suhartono D . Pre-trained transformer-based language models for sundanese. Journal of Big Data, 2022, 9( 1): 39
|
| [71] |
Torge S, Politov A, Lehmann C, Saffar B, Tao Z. Named entity recognition for low-resource languages-profiting from language families. In: Proceedings of the 9th Workshop on Slavic Natural Language Processing. 2023, 1–10
|
| [72] |
Rönnqvist S, Kanerva J, Salakoski T, Ginter F. Is multilingual BERT fluent in language generation? In: Proceedings of the 1st NLPL Workshop on Deep Learning for Natural Language Processing. 2019, 29–36
|
| [73] |
Wang Z, Karthikeyan K, Mayhew S, Roth D. Extending multilingual BERT to low-resource languages. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP. 2020, 2649–2656
|
| [74] |
Choenni R, Garrette D, Shutova E. How do languages influence each other? Studying cross-lingual data sharing during LM fine-tuning. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 13244–13257
|
| [75] |
Wang Y, Yu Z, Wang J, Heng Q, Chen H, Ye W, Xie R, Xie X, Zhang S . Exploring vision-language models for imbalanced learning. International Journal of Computer Vision, 2024, 132( 1): 224–237
|
| [76] |
Jiang Y, Qiu R, Zhang Y, Zhang P F. Balanced and explainable social media analysis for public health with large language models. In: Proceedings of the 34th Australasian Database Conference on Databases Theory and Applications. 2024, 73–86
|
| [77] |
Lin X V, Mihaylov T, Artetxe M, Wang T, Chen S et al, . Few-shot learning with multilingual generative language models. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 9019–9052
|
| [78] |
Tian L, Zhang X, Lau J H. Rumour detection via zero-shot cross-lingual transfer learning. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. 2021, 603–618
|
| [79] |
Shi F, Suzgun M, Freitag M, Wang X, Srivats S, Vosoughi S, Chung H W, Tay Y, Ruder S, Zhou D, Das D, Wei J. Language models are multilingual chain-of-thought reasoners. In: Proceedings of the 11th International Conference on Learning Representations. 2023
|
| [80] |
Ogunremi T, Jurafsky D, Manning C D. Mini but mighty: Efficient multilingual pretraining with linguistically-informed data selection. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 1251–1266
|
| [81] |
Ogueji K, Zhu Y, Lin J. Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. In: Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021, 116–126
|
| [82] |
Pikuliak M, Šimko M, Bieliková M . Cross-lingual learning for text processing: a survey. Expert Systems with Applications, 2021, 165: 113765
|
| [83] |
Philippy F, Guo S, Haddadan S. Towards a common understanding of contributing factors for cross-lingual transfer in multilingual language models: a review. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 5877–5891
|
| [84] |
Penedo G, Malartic Q, Hesslow D, Cojocaru R, Alobeidli H, Cappelli A, Pannier B, Almazrouei E, Launay J. The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3464
|
| [85] |
Luo Y, Kong Q, Xu N, Cao J, Hao B. et al, . YAYI 2: multilingual open-source large language models. 2023, arXiv preprint arXiv: 2312.14862
|
| [86] |
Sun H, Jin R, Xu S, Pan L, Supryadi, Cui M, Du J, Lei Y, Yang L, Shi L, Xiao J, Zhu S, Xiong D. FuxiTranyu: a multilingual large language model trained with balanced data. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 1499–1522
|
| [87] |
Adelani D, Neubig G, Ruder S, Rijhwani S, Beukman M. et al, . MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 4488–4508
|
| [88] |
Malmasi S, Fang A, Fetahu B, Kar S, Rokhlenko O. MultiCoNER: a large-scale multilingual dataset for complex named entity recognition. In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 3798–3809
|
| [89] |
Öhman E, Pàmies M, Kajava K, Tiedemann J. XED: a multilingual dataset for sentiment analysis and emotion detection. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020, 6542–6552
|
| [90] |
Shode I, Adelani D I, Peng J, Feldman A. NollySenti: Leveraging transfer learning and machine translation for Nigerian movie sentiment classification. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 986–998
|
| [91] |
Muhammad S H, Adelani D I, Ruder S, Ahmad I S, Abdulmumin I, Bello B S, Choudhury M, Emezue C C, Abdullahi S S, Aremu A, Jorge A, Brazdil P. NaijaSenti: a Nigerian twitter sentiment corpus for multilingual sentiment analysis. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 590–602
|
| [92] |
Ogundepo O, Zhang X, Sun S, Duh K, Lin J. AfriCLIRMatrix: enabling cross-lingual information retrieval for African languages. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 8721–8728
|
| [93] |
Sun S, Duh K. CLIRMatrix: a massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4160–4170
|
| [94] |
Ma C, Imani A, Ye H, Asgari E, Schütze H. Taxi1500: a multilingual dataset for text classification in 1500 languages. 2023, arXiv preprint arXiv: 2305.08487
|
| [95] |
Keung P, Lu Y, Szarvas G, Smith N A. The multilingual Amazon reviews corpus. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4563–4568
|
| [96] |
Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H. Word translation without parallel data. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
| [97] |
Linguatools.org. Wikipedia monolingual corpora. See linguatools/tools/corpora/wikipedia-monolingual-corpora/ website, 2018
|
| [98] |
Palen-Michel C, Kim J, Lignos C. Multilingual open text release 1: Public domain news in 44 languages. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 2080–2089
|
| [99] |
Lison P, Tiedemann J. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the 10th International Conference on Language Resources and Evaluation. 2016, 923–929
|
| [100] |
Zhu W, Liu H, Dong Q, Xu J, Huang S, Kong L, Chen J, Li L. Multilingual machine translation with large language models: empirical results and analysis. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 2765–2781
|
| [101] |
Goyal N, Du J, Ott M, Anantharaman G, Conneau A. Larger-scale transformers for multilingual masked language modeling. In: Proceedings of the 6th Workshop on Representation Learning for NLP. 2021, 29–33
|
| [102] |
Bojanowski P, Grave E, Joulin A, Mikolov T . Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, 5: 135–146
|
| [103] |
Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 789–798
|
| [104] |
Søgaard A, Ruder S, Vulić I. On the limitations of unsupervised bilingual dictionary induction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 778–788
|
| [105] |
Nakashole N. NORMA: Neighborhood sensitive maps for multilingual word embeddings. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 512–522
|
| [106] |
Wang H, Henderson J, Merlo P. Multi-adversarial learning for cross-lingual word embeddings. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics. 2021, 463–472
|
| [107] |
Sarzynska-Wawer J, Wawer A, Pawlak A, Szymanowska J, Stefaniak I, Jarkiewicz M, Okruszek L . Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 2021, 304: 114135
|
| [108] |
Schuster T, Ram O, Barzilay R, Globerson A. Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistic. 2019, 1599–1613
|
| [109] |
Gage P . A new algorithm for data compression. The C Users Journal, 1994, 12( 2): 23–38
|
| [110] |
Schuster M, Nakajima K. Japanese and Korean voice search. In: Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. 2012, 5149–5152
|
| [111] |
Vulić I, Ponti E M, Litschko R, Glavaš G, Korhonen A. Probing pretrained language models for lexical semantics. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 7222–7240
|
| [112] |
Zhang J, Ji B, Xiao N, Duan X, Zhang M, Shi Y, Luo W. Combining static word embeddings and contextual representations for bilingual lexicon induction. In: Proceedings of the Findings of the Association for Computational Linguistics. 2021, 2943–2955
|
| [113] |
Hämmerl K, Libovický J, Fraser A. Combining static and contextualised multilingual embeddings. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 2316–2329
|
| [114] |
Zheng J, Wang Y, Wang G, Xia J, Huang Y, Zhao G, Zhang Y, Li S. Using context-to-vector with graph retrofitting to improve word embeddings. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8154–8163
|
| [115] |
Li Y, Liu F, Collier N, Korhonen A, Vulić I. Improving word translation via two-stage contrastive learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 4353–4374
|
| [116] |
Alvarez-Melis D, Jaakkola T. Gromov-wasserstein alignment of word embedding spaces. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 1881–1890
|
| [117] |
Ren S, Liu S, Zhou M, Ma S. A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 3476–3485
|
| [118] |
Mohiuddin T, Joty S. Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 3857–3867
|
| [119] |
Mohiuddin T, Bari M S, Joty S. LNMap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 2712–2723
|
| [120] |
Glavaš G, Vulić I. Non-linear instance-based cross-lingual mapping for non-isomorphic embedding spaces. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7548–7555
|
| [121] |
Marchisio K, Verma N, Duh K, Koehn P. IsoVec: controlling the relative isomorphism of word embedding spaces. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 6019–6033
|
| [122] |
Singh J, McCann B, Socher R, Xiong C. BERT is not an interlingua and the bias of tokenization. In: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP. 2019, 47–55
|
| [123] |
Taitelbaum H, Chechik G, Goldberger J. Multilingual word translation using auxiliary languages. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 1330–1335
|
| [124] |
Karthikeyan K, Wang Z, Mayhew S, Roth D. Cross-lingual ability of multilingual BERT: an empirical study. In: Proceedings of the 8th International Conference on Learning Representations. 2020
|
| [125] |
Liu C L, Hsu T Y, Chuang Y S, Lee H Y. A study of cross-lingual ability and language-specific information in multilingual BERT. 2020, arXiv preprint arXiv: 2004.09205
|
| [126] |
Ranjan R, Gupta S, Singh S N. A comprehensive survey of bias in LLMs: current landscape and future directions. 2024, arXiv preprint arXiv: 2409.16430
|
| [127] |
Cao S, Cheng R, Wang Z. AGR: age group fairness reward for bias mitigation in LLMs. 2024, arXiv preprint arXiv: 2409.04340
|
| [128] |
Ahn J, Oh A. Mitigating language-dependent ethnic bias in BERT. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 533–549
|
| [129] |
Meade N, Poole-Dayan E, Reddy S. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1878–1898
|
| [130] |
Zhao J, Mukherjee S, Hosseini S, Chang K W, Awadallah A H. Gender bias in multilingual embeddings and cross-lingual transfer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2896–2907
|
| [131] |
Ferrara E. Should ChatGPT be biased? Challenges and risks of bias in large language models. 2023, arXiv preprint arXiv: 2304.03738
|
| [132] |
Wu S, Dredze M. Are all languages created equal in multilingual BERT? In: Proceedings of the 5th Workshop on Representation Learning for NLP. 2020, 120–130
|
| [133] |
Wang J, Liu Y, Wang X. Assessing multilingual fairness in pre-trained multimodal representations. In: Proceedings of the Findings of the Association for Computational Linguistics. 2022, 2681–2695
|
| [134] |
Kassner N, Dufter P, Schütze H. Multilingual LAMA: investigating knowledge in multilingual pretrained language models. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. 2021, 3250–3258
|
| [135] |
Levy S, John N A, Liu L, Vyas Y, Ma J, Fujinuma Y, Ballesteros M, Castelli V, Roth D. Comparing biases and the impact of multilingual training across multiple languages. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 10260–10280
|
| [136] |
Piqueras L C, Søgaard A. Are pretrained multilingual models equally fair across languages? In: Proceedings of the 29th International Conference on Computational Linguistics. 2022, 3597–3605
|
| [137] |
Touileb S, Øvrelid L, Velldal E. Occupational biases in Norwegian and multilingual language models. In: Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing. 2022, 200–211
|
| [138] |
Naous T, Ryan M J, Ritter A, Xu W. Having beer after prayer? Measuring cultural bias in large language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 16366–16393
|
| [139] |
Abid A, Farooqi M, Zou J . Large language models associate Muslims with violence. Nature Machine Intelligence, 2021, 3( 6): 461–463
|
| [140] |
Cao Y T, Pruksachatkun Y, Chang K W, Gupta R, Kumar V, Dhamala J, Galstyan A. On the intrinsic and extrinsic fairness evaluation metrics for contextualized language representations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 561–570
|
| [141] |
Leiter C, Lertvittayakumjorn P, Fomicheva M, Zhao W, Gao Y, Eger S . Towards explainable evaluation metrics for machine translation. Journal of Machine Learning Research, 2024, 25( 75): 1–49
|
| [142] |
Sun T, He J, Qiu X, Huang X. BERTScore is unfair: on social bias in language model-based metrics for text generation. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 3726–3739
|
| [143] |
Zhang T, Kishore V, Wu F, Weinberger K Q, Artzi Y. BERTScore: evaluating text generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations. 2020
|
| [144] |
Sellam T, Das D, Parikh A. BLEURT: learning robust metrics for text generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7881–7892
|
| [145] |
Yuan W, Neubig G, Liu P. BARTSCORE: evaluating generated text as text generation. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 2088
|
| [146] |
Koo R, Lee M, Raheja V, Park J I, Kim Z M, Kang D. Benchmarking cognitive biases in large language models as evaluators. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 517–545
|
| [147] |
Delobelle P, Tokpo E, Calders T, Berendt B. Measuring fairness with biased rulers: A comparative study on bias metrics for pre-trained language models. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics. 2022, 1693–1706
|
| [148] |
Caliskan A, Bryson J J, Narayanan A . Semantics derived automatically from language corpora Contain human-like biases. Science, 2017, 356( 6334): 183–186
|
| [149] |
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018, 353–355
|
| [150] |
May C, Wang A, Bordia S, Bowman S R, Rudinger R. On measuring social biases in sentence encoders. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 622–628
|
| [151] |
Guo W, Caliskan A. Detecting emergent intersectional biases: contextualized word embeddings contain a distribution of human-like biases. In: Proceedings of 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021, 122–133
|
| [152] |
Bansal S, Garimella V, Suhane A, Mukherjee A. Debiasing multilingual word embeddings: a case study of three Indian languages. In: Proceedings of the 32nd ACM Conference on Hypertext and Social Media. 2021, 27–34
|
| [153] |
Rudinger R, Naradowsky J, Leonard B, Van Durme B. Gender bias in coreference resolution. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 8–14
|
| [154] |
Zhao J, Wang T, Yatskar M, Ordonez V, Chang K W. Gender bias in coreference resolution: Evaluation and debiasing methods. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 15–20
|
| [155] |
Kiritchenko S, Mohammad S. Examining gender and race bias in two hundred sentiment analysis systems. In: Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 2018, 43–53
|
| [156] |
Nangia N, Vania C, Bhalerao R, Bowman S R. CrowS-Pairs: a challenge dataset for measuring social biases in masked language models. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 1953–1967
|
| [157] |
Stanovsky G, Smith N A, Zettlemoyer L. Evaluating gender bias in machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 1679–1684
|
| [158] |
De-Arteaga M, Romanov A, Wallach H, Chayes J, Borgs C, Chouldechova A, Geyik S, Kenthapadi K, Kalai A T. Bias in bios: a case study of semantic representation bias in a high-stakes setting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. 2019, 120–128
|
| [159] |
Karkkainen K, Joo J. FairFace: face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 1547–1557
|
| [160] |
Lauscher A, Glavaš G. Are we consistently biased? Multidimensional analysis of biases in distributional word vectors. In: Proceedings of the 8th Joint Conference on Lexical and Computational Semantics. 2019, 85–91
|
| [161] |
Névéol A, Dupont Y, Bezançon J, Fort K. French CrowS-pairs: extending a challenge dataset for measuring social bias in masked language models to a language other than English. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 8521–8531
|
| [162] |
Wan Y, Pu G, Sun J, Garimella A, Chang K W, Peng N. “Kelly is a warm person, joseph is a role model”: Gender biases in LLM-generated reference letters. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 3730–3748
|
| [163] |
Liang P P, Li I M, Zheng E, Lim Y C, Salakhutdinov R, Morency L P. Towards debiasing sentence representations. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5502–5515
|
| [164] |
Ravfogel S, Elazar Y, Gonen H, Twiton M, Goldberg Y. Null it out: Guarding protected attributes by iterative nullspace projection. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 7237–7256
|
| [165] |
Yang Z, Yang Y, Cer D, Darve E. A simple and effective method to eliminate the self language bias in multilingual representations. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 5825–5832
|
| [166] |
Webster K, Wang X, Tenney I, Beutel A, Pitler E, Pavlick E, Chen J, Chi E, Petrov S. Measuring and reducing gendered correlations in pre-trained models. 2020, arXiv preprint arXiv: 2010.06032
|
| [167] |
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R . Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014, 15( 1): 1929–1958
|
| [168] |
Zhou F, Mao Y, Yu L, Yang Y, Zhong T. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 4227–4241
|
| [169] |
Ranaldi L, Ruzzetti E S, Venditti D, Onorati D, Zanzotto F M. A trip towards fairness: Bias and de-biasing in large language models. In: Proceedings of the 13th Joint Conference on Lexical and Computational Semantics. 2024, 372–384
|
| [170] |
Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. Lora: Low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022
|
| [171] |
Wang A, Russakovsky O. Overwriting pretrained bias with finetuning data. In: Proceedings of IEEE/CVF International Conference on Computer Vision. 2023, 3934–3945
|
| [172] |
Guo Y, Yang Y, Abbasi A. Auto-debias: debiasing masked language models with automated biased prompts. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1012–1023
|
| [173] |
Mattern J, Jin Z, Sachan M, Mihalcea R, Schölkopf B. Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing. 2022, arXiv preprint arXiv: 2212.10678
|
| [174] |
Dhingra H, Jayashanker P, Moghe S, Strubell E . Queer people are people first: Deconstructing sexual identity stereotypes in large language models. 2023, arXiv preprint arXiv: 2307, 0010, 1: 2023
|
| [175] |
Schick T, Udupa S, Schütze H . Self-diagnosis and self-debiasing: a proposal for reducing corpus-based bias in NLP. Transactions of the Association for Computational Linguistics, 2021, 9: 1408–1424
|
| [176] |
Conneau A, Rinott R, Lample G, Williams A, Bowman S, Schwenk H, Stoyanov V. XNLI: Evaluating cross-lingual sentence representations. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 2475–2485
|
| [177] |
Nguyen T, Van Nguyen C, Lai V D, Man H, Ngo N T, Dernoncourt F, Rossi R A, Nguyen T H. CulturaX: a cleaned, enormous, and multilingual dataset for large language models in 167 languages. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 4226–4237
|
| [178] |
Laurençon H, Saulnier L, Wang T, Akiki C, Del Moral A V. et al, . The BigScience roots corpus: a 1.6TB composite multilingual dataset. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2306
|
| [179] |
Kreutzer J, Caswell I, Wang L, Wahab A, Van Esch D. et al . . Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 2022, 10: 50–72
|
| [180] |
Sen I, Assenmacher D, Samory M, Augenstein I, Aalst W, Wagner C. People make better edits: measuring the efficacy of LLM-generated counterfactually augmented data for harmful language detection. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 10480–10504
|
| [181] |
Zhao J, Wang T, Yatskar M, Cotterell R, Ordonez V, Chang K W. Gender bias in contextualized word embeddings. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics. 2019, 629–634
|
| [182] |
Yang L, Li J, Cunningham P, Zhang Y, Smyth B, Dong R. Exploring the efficacy of automatically generated counterfactuals for sentiment analysis. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021, 306–316
|
| [183] |
Sen I, Samory M, Flöck F, Wagner C, Augenstein I. How does counterfactually augmented data impact models for social computing constructs? In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 325–344
|
| [184] |
Goldfarb-Tarrant S, Lopez A, Blanco R, Marcheggiani D. Bias beyond English: counterfactual tests for bias in sentiment analysis in four languages. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 4458–4468
|
| [185] |
Sen I, Samory M, Wagner C, Augenstein I. Counterfactually augmented data and unintended bias: The case of sexism and hate speech detection. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistic. 2022, 4716–4726
|
| [186] |
Joshi N, He H. An investigation of the (in)effectiveness of counterfactually augmented data. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 3668–3681
|
| [187] |
Zhang Q, Duan Q, Yuan B, Shi Y, Liu J. Exploring accuracy-fairness trade-off in large language models. 2024, arXiv preprint arXiv: 2411.14500
|
| [188] |
Lin Z, Guan S, Zhang W, Zhang H, Li Y, Zhang H . Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models. Artificial Intelligence Review, 2024, 57( 9): 243
|
| [189] |
Yang N, Kang T, Choi S J, Lee H, Jung K. Mitigating biases for instruction-following language models via bias neurons elimination. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 9061–9073
|
| [190] |
Yadav H, Sitaram S. A survey of multilingual models for automatic speech recognition. In: Proceedings of the 13th Language Resources and Evaluation Conference. 2022, 5071–5079
|
| [191] |
Hu J, Ruder S, Siddhant A, Neubig G, Firat O, Johnson M. XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 410
|
| [192] |
Dufter P, Schütze H. Identifying elements essential for BERT’s multilinguality. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing. 2020, 4423–4437
|
| [193] |
Nzeyimana A, Niyongabo Rubungo A. KinyaBERT: a morphology-aware Kinyarwanda language model. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 5347–5363
|
| [194] |
Naveed H, Khan A U, Qiu S, Saqib M, Anwar S, Usman M, Akhtar N, Barnes N, Mian A. A comprehensive overview of large language models. 2023, arXiv preprint arXiv: 2307.06435
|
| [195] |
Pan X, Zhang B, May J, Nothman J, Knight K, Ji H. Cross-lingual name tagging and linking for 282 languages. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017, 1946–1958
|
| [196] |
Liu F, Bugliarello E, Ponti E M, Reddy S, Collier N, Elliott D. Visually grounded reasoning across languages and cultures. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 10467–10485
|
RIGHTS & PERMISSIONS
The Author(s) 2025. This article is published with open access at link.springer.com and journal.hep.com.cn