Romanization-enhanced large language models for parallel corpus annotation

Siqi ZHANG , Kairong LIU , Ran SONG , Yuxin HUANG , Cunli MAO , Zhengtao YU

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (1) : 2101311

PDF (819KB)
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (1) :2101311 DOI: 10.1007/s11704-025-50704-6
Artificial Intelligence
LETTER
Romanization-enhanced large language models for parallel corpus annotation
Author information +
History +
PDF (819KB)

Graphical abstract

Cite this article

Download citation ▾
Siqi ZHANG, Kairong LIU, Ran SONG, Yuxin HUANG, Cunli MAO, Zhengtao YU. Romanization-enhanced large language models for parallel corpus annotation. Front. Comput. Sci., 2027, 21(1): 2101311 DOI:10.1007/s11704-025-50704-6

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Van Nguyen V, Nguyen H, Le H T, Nguyen T P, Van Bui T, Pham L N, Phan A T, Nguyen C H M, Tran V H, Tran A H. KC4MT: a high-quality corpus for multilingual machine translation. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022, 5494–5502

[2]

Imankulova A, Sato T, Komachi M . Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 2019, 19( 2): 24

[3]

Ding C, Utiyama M, Sumita E. Similar Southeast Asian languages: corpus-based case study on Thai–Laotian and Malay–Indonesian. In: Proceedings of the 3rd Workshop on Asian Translation (WAT2016). 2016, 149–156

[4]

Hermjakob U, May J, Knight K. Out-of-the-box universal Romanization tool uroman. In: Proceedings of ACL 2018, System Demonstrations. 2018, 13–18

[5]

Phatthiyaphaibun W, Chaovavanich K, Polpanumas C, Suriyawongkul A, Lowphansirikul L, Chormai P, Limkonchotiwat P, Suntorntip T, Udomcharoenchaikit C. PyThaiNLP: Thai natural language processing in Python. In: Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS). 2023, 25–36

[6]

Keung P, Salazar J, Lu Y, Smith N A . Unsupervised bitext mining and translation via self-trained contextual embeddings. Transactions of the Association for Computational Linguistics, 2020, 8: 828–841

[7]

Feng F, Yang Y, Cer D, Arivazhagan N, Wang W. Language-agnostic BERT sentence embedding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 878–891

[8]

Wang L, Yang N, Huang X, Jiao B, Yang L, Jiang D, Majumder R, Wei F. Text embeddings by weakly-supervised contrastive pre-training. 2022, arXiv preprint arXiv: 2212.03533

[9]

Cui Y, Yao X. Rethinking LLM language adaptation: a case study on Chinese Mixtral. 2024, arXiv preprint arXiv: 2403.01851

RIGHTS & PERMISSIONS

Higher Education Press

PDF (819KB)

Supplementary files

Highlights

247

Accesses

0

Citation

Detail

Sections
Recommended

/