Bridging the gap: adapting LLMs for southeast asian low-resource machine translation via hierarchical dynamic retrieval and matching

Zirui GUO , Hua LAI , Ying LI , Zhengtao YU , Shengxiang GAO , Yuxin HUANG , Cunli MAO

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (2) : 2102327

PDF (3786KB)
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (2) :2102327 DOI: 10.1007/s11704-025-51670-9
Artificial Intelligence
RESEARCH ARTICLE
Bridging the gap: adapting LLMs for southeast asian low-resource machine translation via hierarchical dynamic retrieval and matching
Author information +
History +
PDF (3786KB)

Abstract

Retrieval-Augmented Generation (RAG) has proven its effectiveness in enhancing the generation capabilities of large language models (LLMs) for various natural language processing tasks. However, its ability in low-resource machine translation drops sharply due to the noise interference caused by the semantic mismatch between retrieved content and translation requirements. To alleviate this drawback, we propose a novel hierarchical dynamic retrieval and matching approach for Southeast Asian low-resource machine translation. First, we construct a hierarchical index structure that utilizes high-frequency word statistics as key indices based on an existing parallel corpus, associating bilingual short and long sentence pairs. Second, we dynamically match words between the source sentence and the hierarchical index structure to retrieve all associated short and long bilingual sentence pairs. Meanwhile, we rerank the candidate samples by computing cross-lingual semantic similarity between the source sentence and the retrieved pairs. Finally, the sample with the highest semantic similarity is integrated into the prompt to guide LLMs in generating more accurate translations. Experimental results show that our approach outperforms mainstream machine translation systems without fine-tuning LLM parameters. Detailed analysis indicates that our method precisely matches fine-grained semantic information, thus reducing noise interference and improving low-resource translation performance.

Graphical abstract

Keywords

large language model / machine translation / retrieval-augmented generation / Chinese / Southeast Asian languages

Cite this article

Download citation ▾
Zirui GUO, Hua LAI, Ying LI, Zhengtao YU, Shengxiang GAO, Yuxin HUANG, Cunli MAO. Bridging the gap: adapting LLMs for southeast asian low-resource machine translation via hierarchical dynamic retrieval and matching. Front. Comput. Sci., 2027, 21(2): 2102327 DOI:10.1007/s11704-025-51670-9

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Chen J, Lin H, Han X, Sun L. Benchmarking large language models in retrieval-augmented generation. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 17754–17762

[2]

Asai A, Wu Z, Wang Y, Sil A, Hajishirzi H. Self-RAG: learning to retrieve, generate, and critique through self-reflection. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[3]

Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih W T, Rocktäschel T, Riedel S, Kiela D. Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 793

[4]

Huang Y, Huang J. A survey on retrieval-augmented text generation for large language models. 2024, arXiv preprint arXiv: 2404.10981

[5]

Zhu K, Feng X, Du X, Gu Y, Yu W, Wang H, Chen Q, Chu Z, Chen J, Qin B. An information bottleneck perspective for effective noise filtering on retrieval-augmented generation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 1044–1069

[6]

Xu S, Pang L, Yu M, Meng F, Shen H, Cheng X, Zhou J. Unsupervised information refinement training of large language models for retrieval-augmented generation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 133–145

[7]

Min S, Lyu X, Holtzman A, Artetxe M, Lewis M, Hajishirzi H, Zettlemoyer L. Rethinking the role of demonstrations: what makes in-context learning work? In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 11048–11064

[8]

Hendy A, Abdelrehim M, Sharaf A, Raunak V, Gabr M, Matsushita H, Kim Y J, Afify M, Awadalla H H. How good are GPT models at machine translation? A comprehensive evaluation. 2023, arXiv preprint arXiv: 2302.09210

[9]

Alam F, Chowdhury S A, Boughorbel S, Hasanain M. LLMs for low resource languages in multilingual, multimodal and dialectal settings. In: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts. 2024, 27–33

[10]

Och F J, Ney H. Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of the 40th Annual meeting of the Association for Computational Linguistics. 2002, 295–302

[11]

Koehn P, Och F J, Marcu D. Statistical phrase-based translation. In: Proceedings of 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003, 127–133

[12]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010

[13]

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations. 2015

[14]

Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J. Fast and robust neural network joint models for statistical machine translation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2014, 1370–1380

[15]

Ranathunga S, Lee E S A, Prifti Skenduli M, Shekhar R, Alam M, Kaur R . Neural machine translation for low-resource languages: a survey. ACM Computing Surveys, 2023, 55( 11): 229

[16]

Lample G, Conneau A, Denoyer L, Ranzato M. Unsupervised machine translation using monolingual corpora only. In: Proceedings of the 6th International Conference on Learning Representations. 2018

[17]

Prabhumoye S, Tsvetkov Y, Salakhutdinov R, Black A W. Style transfer through back-translation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, 866–876

[18]

Imankulova A, Sato T, Komachi M . Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 2020, 19( 2): 24

[19]

Ouyang X, Wang S, Pang C, Sun Y, Tian H, Wu H, Wang H. Ernie-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 27–38

[20]

Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. OpenAI, 2018

[21]

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[22]

NLLB Team, Costa-Jussà M R, Cross J, Çelebi O, Elbayad M, Heafield K, Heffernan K, Kalbassi E, Lam J, Licht D, Maillard J, Sun A, Wang S, Wenzek G, Youngblood A, Akula B, Barrault L, Gonzalez G M, Hansanti P, Hoffman J, Jarrett S, Sadagopan K R, Rowe D, Spruit S, Tran C, Andrews P, Ayan N F, Bhosale S, Edunov S, Fan A, Gao C, Goswami V, Guzmán F, Koehn P, Mourachko A, Ropers C, Saleem S, Schwenk H, Wang J. No language left behind: scaling human-centered machine translation. 2022, arXiv preprint arXiv: 2207.04672

[23]

Fan A, Bhosale S, Schwenk H, Ma Z, El-Kishky A, Goyal S, Baines M, Celebi O, Wenzek G, Chaudhary V, Goyal N, Birch T, Liptchinsky V, Edunov S, Auli M, Joulin A . Beyond English-centric multilingual machine translation. Journal of Machine Learning Research, 2021, 22( 107): 1–48

[24]

Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[25]

Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, De Laroussilhe Q, Gesmundo A, Attariyan M, Gelly S. Parameter-efficient transfer learning for NLP. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2790–2799

[26]

Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: can language models be too big? In: Proceedings of 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, 610–623

[27]

Winata G I, Madotto A, Lin Z, Liu R, Yosinski J, Fung P. Language models are few-shot multilingual learners. 2021, arXiv preprint arXiv: 2109.07684

[28]

BigScience Workshop, Scao T L, Fan A, Akiki C, Pavlick E, , et al. BLOOM: a 176B-parameter open-access multilingual language model. 2022, arXiv preprint arXiv: 2211.05100

[29]

Fan W, Ding Y, Ning L, Wang S, Li H, Yin D, Chua T S, Li Q. A survey on RAG meeting LLMs: towards retrieval-augmented large language models. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024, 6491–6501

[30]

Ma X, Gong Y, He P, Zhao H, Duan N. Query rewriting in retrieval-augmented large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 5303–5315

[31]

Wang Z, Wang Z, Le L, Zheng H S, Mishra S, Perot V, Zhang Y, Mattapalli A, Taly A, Shang J, Lee C Y, Pfister T. Speculative RAG: enhancing retrieval augmented generation through drafting. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[32]

Sawarkar K, Mangal A, Solanki S R. Blended RAG: improving RAG (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers. In: Proceedings of the 7th IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR). 2024, 155–161

[33]

Feng Z, Kuang D, Wang Z, Nie Z, Zheng Y, Zhang R. EasyRAG: efficient retrieval-augmented generation framework for automated network operations. 2024, arXiv preprint arXiv: 2410.10315

[34]

Yoon S, Choi E, Kim J, Yun H, Kim Y, Hwang S W. ListT5: listwise reranking with fusion-in-decoder improves zero-shot retrieval. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024, 2287–2308

[35]

Gao Y, Xiong Y, Wang M, Wang H. Modular RAG: transforming RAG systems into LEGO-like reconfigurable frameworks. 2024, arXiv preprint arXiv: 2407.21059

[36]

Wang X, Yang Q, Qiu Y, Liang J, He Q, Gu Z, Xiao Y, Wang W. KnowledGPT: enhancing large language models with retrieval and storage access on knowledge bases. 2023, arXiv preprint arXiv: 2308.11761

[37]

Zhao S, Yang Y, Wang Z, He Z, Qiu L K, Qiu L. Retrieval augmented generation (RAG) and beyond: a comprehensive survey on how to make your LLMs use external data more wisely. 2024, arXiv preprint arXiv: 2409.14924

[38]

Puduppully R, Kunchukuttan A, Dabre R, Aw A T, Chen N F. Decomposed prompting for machine translation between related languages using large language models. 2023, arXiv preprint arXiv: 2305.13085

[39]

Zhu S, Cui M, Xiong D. Towards robust in-context learning for machine translation with large language models. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 16619–16629

[40]

Jiang Z, Xu F F, Gao L, Sun Z, Liu Q, Dwivedi-Yu J, Yang Y, Callan J, Neubig G. Active retrieval augmented generation. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 7969–7992

[41]

Shi W, Min S, Yasunaga M, Seo M, James R, Lewis M, Zettlemoyer L, Yih W T. REPLUG: retrieval-augmented black-box language models. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024, 8371–8384

[42]

Cuconasu F, Trappolini G, Siciliano F, Filice S, Campagnano C, Maarek Y, Tonellotto N, Silvestri F. The power of noise: redefining retrieval for RAG systems. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024, 719–729

[43]

Babenko A, Lempitsky V . The inverted multi-index. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37( 6): 1247–1260

[44]

Robertson S, Zaragoza H . The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 2009, 3( 4): 333–389

[45]

Artetxe M, Schwenk H. Margin-based parallel corpus mining with multilingual sentence embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 3197–3203

[46]

Schwenk H, Wenzek G, Edunov S, Grave E, Joulin A, Fan A. CCMatrix: mining billions of high-quality parallel sentences on the web. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 6490–6500

[47]

Thu Y K, Pa W P, Utiyama M, Finch A, Sumita E. Introducing the Asian language treebank (ALT). In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 2016, 1574–1578

[48]

Qwen Team. Qwen2.5 technical report. 2024, arXiv preprint arXiv: 2412.15115

[49]

Llama Team, AI @ Meta. The Llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783

[50]

Chen A, Lou L, Chen K, Bai X, Xiang Y, Yang M, Zhao T, Zhang M. DUAL-REFLECT: enhancing large language models for reflective translation through dual learning feedback mechanisms. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2024, 693–704

[51]

Goyal N, Gao C, Chaudhary V, Chen P J, Wenzek G, Ju D, Krishnan S, Ranzato M, Guzmán F, Fan A . The FLORES-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 2022, 10: 522–538

[52]

Rei R, de Souza J G C, Alves D, Zerva C, Farinha A C, Glushkova T, Lavie A, Coheur L, Martins A F T. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In: Proceedings of the 7th Conference on Machine Translation (WMT). 2022, 578–585

[53]

Li C, Liu Z, Xiao S, Shao Y. Making large language models a better foundation for dense retrieval. 2023, arXiv preprint arXiv: 2312.15503

[54]

Chen J, Xiao S, Zhang P, Luo K, Lian D, Liu Z. BGE M3-embedding: multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. 2024, arXiv preprint arXiv: 2402.03216

[55]

Gemma Team, Google DeepMind. Gemma 2: improving open language models at a practical size. 2024, arXiv preprint arXiv: 2408.00118

RIGHTS & PERMISSIONS

Higher Education Press

PDF (3786KB)

Supplementary files

Highlights

1475

Accesses

0

Citation

Detail

Sections
Recommended

/