Romanization-enhanced large language models for parallel corpus annotation

Siqi ZHANG; Kairong LIU; Ran SONG; Yuxin HUANG; Cunli MAO; Zhengtao YU

doi:10.1007/s11704-025-50704-6

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (1) :2101311 DOI: 10.1007/s11704-025-50704-6

Artificial Intelligence

LETTER

Romanization-enhanced large language models for parallel corpus annotation

Siqi ZHANG ¹^,²
, Kairong LIU ¹^,²
, Ran SONG ¹^,²
, Yuxin HUANG ¹^,²
, Cunli MAO ¹^,²
, Zhengtao YU ¹^,²

Author information +

History +

PDF (819KB)

Graphical abstract

Cite this article

Download citation ▾

Siqi ZHANG, Kairong LIU, Ran SONG, Yuxin HUANG, Cunli MAO, Zhengtao YU. Romanization-enhanced large language models for parallel corpus annotation. Front. Comput. Sci., 2027, 21(1): 2101311 DOI:10.1007/s11704-025-50704-6

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Van Nguyen V, Nguyen H, Le H T, Nguyen T P, Van Bui T, Pham L N, Phan A T, Nguyen C H M, Tran V H, Tran A H. KC4MT: a high-quality corpus for multilingual machine translation. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022, 5494–5502

[2]	Imankulova A, Sato T, Komachi M . Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 2019, 19( 2): 24

[3]	Ding C, Utiyama M, Sumita E. Similar Southeast Asian languages: corpus-based case study on Thai–Laotian and Malay–Indonesian. In: Proceedings of the 3rd Workshop on Asian Translation (WAT2016). 2016, 149–156

[4]	Hermjakob U, May J, Knight K. Out-of-the-box universal Romanization tool uroman. In: Proceedings of ACL 2018, System Demonstrations. 2018, 13–18

[5]

Phatthiyaphaibun W, Chaovavanich K, Polpanumas C, Suriyawongkul A, Lowphansirikul L, Chormai P, Limkonchotiwat P, Suntorntip T, Udomcharoenchaikit C. PyThaiNLP: Thai natural language processing in Python. In: Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS). 2023, 25–36

[6]	Keung P, Salazar J, Lu Y, Smith N A . Unsupervised bitext mining and translation via self-trained contextual embeddings. Transactions of the Association for Computational Linguistics, 2020, 8: 828–841

[7]	Feng F, Yang Y, Cer D, Arivazhagan N, Wang W. Language-agnostic BERT sentence embedding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 878–891

[8]	Wang L, Yang N, Huang X, Jiao B, Yang L, Jiang D, Majumder R, Wei F. Text embeddings by weakly-supervised contrastive pre-training. 2022, arXiv preprint arXiv: 2212.03533