Bridging the gap: adapting LLMs for southeast asian low-resource machine translation via hierarchical dynamic retrieval and matching
Zirui GUO , Hua LAI , Ying LI , Zhengtao YU , Shengxiang GAO , Yuxin HUANG , Cunli MAO
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (2) : 2102327
Retrieval-Augmented Generation (RAG) has proven its effectiveness in enhancing the generation capabilities of large language models (LLMs) for various natural language processing tasks. However, its ability in low-resource machine translation drops sharply due to the noise interference caused by the semantic mismatch between retrieved content and translation requirements. To alleviate this drawback, we propose a novel hierarchical dynamic retrieval and matching approach for Southeast Asian low-resource machine translation. First, we construct a hierarchical index structure that utilizes high-frequency word statistics as key indices based on an existing parallel corpus, associating bilingual short and long sentence pairs. Second, we dynamically match words between the source sentence and the hierarchical index structure to retrieve all associated short and long bilingual sentence pairs. Meanwhile, we rerank the candidate samples by computing cross-lingual semantic similarity between the source sentence and the retrieved pairs. Finally, the sample with the highest semantic similarity is integrated into the prompt to guide LLMs in generating more accurate translations. Experimental results show that our approach outperforms mainstream machine translation systems without fine-tuning LLM parameters. Detailed analysis indicates that our method precisely matches fine-grained semantic information, thus reducing noise interference and improving low-resource translation performance.
large language model / machine translation / retrieval-augmented generation / Chinese / Southeast Asian languages
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
Min S, Lyu X, Holtzman A, Artetxe M, Lewis M, Hajishirzi H, Zettlemoyer L. Rethinking the role of demonstrations: what makes in-context learning work? In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 11048–11064 |
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
Koehn P, Och F J, Marcu D. Statistical phrase-based translation. In: Proceedings of 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003, 127–133 |
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
Ouyang X, Wang S, Pang C, Sun Y, Tian H, Wu H, Wang H. Ernie-M: enhanced multilingual representation by aligning cross-lingual semantics with monolingual corpora. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 27–38 |
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
Bender E M, Gebru T, McMillan-Major A, Shmitchell S. On the dangers of stochastic parrots: can language models be too big? In: Proceedings of 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, 610–623 |
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
Ma X, Gong Y, He P, Zhao H, Duan N. Query rewriting in retrieval-augmented large language models. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 5303–5315 |
| [31] |
|
| [32] |
Sawarkar K, Mangal A, Solanki S R. Blended RAG: improving RAG (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers. In: Proceedings of the 7th IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR). 2024, 155–161 |
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
Zhu S, Cui M, Xiong D. Towards robust in-context learning for machine translation with large language models. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 16619–16629 |
| [40] |
Jiang Z, Xu F F, Gao L, Sun Z, Liu Q, Dwivedi-Yu J, Yang Y, Callan J, Neubig G. Active retrieval augmented generation. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 7969–7992 |
| [41] |
Shi W, Min S, Yasunaga M, Seo M, James R, Lewis M, Zettlemoyer L, Yih W T. REPLUG: retrieval-augmented black-box language models. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024, 8371–8384 |
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
Thu Y K, Pa W P, Utiyama M, Finch A, Sumita E. Introducing the Asian language treebank (ALT). In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 2016, 1574–1578 |
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
Rei R, de Souza J G C, Alves D, Zerva C, Farinha A C, Glushkova T, Lavie A, Coheur L, Martins A F T. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In: Proceedings of the 7th Conference on Machine Translation (WMT). 2022, 578–585 |
| [53] |
|
| [54] |
|
| [55] |
|
Higher Education Press
/
| 〈 |
|
〉 |