Generating Chinese named entity data from parallel corpora
Ruiji FU, Bing QIN, Ting LIU
Generating Chinese named entity data from parallel corpora
Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.
named entity recognition / Chinese named entity / training data generating / parallel corpora
[1] |
Zhou G, Su J. Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002, 473-480
|
[2] |
Chieu H L, Ng H T. Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th International Conference on Computational Linguistics. 2002, 1: 1-7
CrossRef
Google scholar
|
[3] |
Takeuchi K, Collier N. Use of support vector machines in extended named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning. 2002, 20: 1-7
|
[4] |
Settles B. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 104-107
|
[5] |
Florian R, Ittycheriah A, Jing H, Zhang T. Named entity recognition through classifier combination. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 168-171
|
[6] |
Klein D, Smarr J, Nguyen H, Manning C D. Named entity recognition with character-level models. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 180-183
|
[7] |
Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C. Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics, 2005, 6(Suppl 1): S5
CrossRef
Google scholar
|
[8] |
Ciaramita M, Altun Y. Named-entity recognition in novel domains with external lexical knowledge. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. 2005, 209-212
|
[9] |
Resnik P, Smith N A. The web as a parallel corpus. Computational Linguistics, 2003, 29(3): 349-380
CrossRef
Google scholar
|
[10] |
Zhang Y, Wu K, Gao J, Vines P. Automatic acquisition of chinese–english parallel corpus from the web. In: Proceedings of the 28th European Conference on Advances in Information Retrieval. 2006, 420-431
CrossRef
Google scholar
|
[11] |
Fu R, Qin B, Liu T. Generating Chinese named entity data from a parallel corpus. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 264-272
|
[12] |
Yarowsky D, Ngai G, Wicentowski R. Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the 1st International Conference on Human Language Technology Research. 2001, 1-8
|
[13] |
Huang F, Vogel S. Improved named entity translation and bilingual named entity extraction. In: Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. 2002, 253-258
CrossRef
Google scholar
|
[14] |
Burkett D, Petrov S, Blitzer J, Klein D. Learning better monolingual models with unannotated bilingual text. In: Proceedings of the 14th Conference on Computational Natural Language Learning. 2010, 46-54
|
[15] |
Hassan A, Fahmy H, Hassan H. Improving named entity translation by exploiting comparable and parallel corpora. In: Proceedings of the 2007 Workshop on Acquisition and Management of Multilingual Lexicons. 2007, 1-6
|
[16] |
An J, Lee S, Lee G G. Automatic acquisition of named entity tagged corpus from world wide web. In: Proceedings of the 41st AnnualMeeting on Association for Computational Linguistics. 2003, 2: 165-168
|
[17] |
Whitelaw C, Kehlenbeck A, Petrovic N, Ungar L. Web-scale named entity recognition. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 123-132
|
[18] |
Richman A E, Schone P. Mining Wiki resources for multilingual named entity recognition. In: Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics. 2008, 1-9
|
[19] |
Nothman J, Curran J R, Murphy T. Transforming wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop. 2008, 124-132
|
[20] |
Vlachos A, Gasperin C. Bootstrapping and evaluating named entity recognition in the biomedical domain. In: Proceedings of the 2006 HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006, 138-145
|
[21] |
Ma X. Champollion: A robust parallel text sentence aligner. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006, 489-492
|
[22] |
Och F J, Ney H. Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. 2000, 440-447
|
[23] |
Koehn P, Och F J, Marcu D. Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 2003, 1: 48-54
|
[24] |
Lafferty J. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 282-289
|
[25] |
Grishman R, Sundheim B. Message understanding conference-6: a brief history. In: Proceedings of the 1996 International Conference on Computational Linguistics. 1996, 466-471
|
[26] |
Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investigationes, 2007, 30(1): 3-26
CrossRef
Google scholar
|
[27] |
De Sitter A, Calders T, Daelemans W. A formal framework for evaluation of information extraction. Technical Report TR 2004–0, Department of Mathematics and Computer Science, Univer<?Pub Caret?>sity of Antwerp, 2004
|
[28] |
Che W, Li Z, Liu T. LTP: a Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. 2010, 13-16
|
[29] |
Wu Y, Zhao J, Xu B, Yu H. Chinese named entity recognition based on multiple features. In: Proceedings of the 2005 Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005, 427-434
CrossRef
Google scholar
|
[30] |
Zhang Y, Vogel S, Waibel A. Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 International Conference on Language Resources and Evaluation. 2004, 2051-2054
|
/
〈 | 〉 |