Generating Chinese named entity data from parallel corpora

Ruiji FU; Bing QIN; Ting LIU

doi:10.1007/s11704-014-3127-5

Front. Comput. Sci. ›› 2014, Vol. 8 ›› Issue (4) :629 -641. DOI: 10.1007/s11704-014-3127-5

RESEARCH ARTICLE

Generating Chinese named entity data from parallel corpora

Author information +

History +

PDF (779KB)

Abstract

Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.

Keywords

named entity recognition / Chinese named entity / training data generating / parallel corpora

Cite this article

Download citation ▾

Ruiji FU, Bing QIN, Ting LIU. Generating Chinese named entity data from parallel corpora. Front. Comput. Sci., 2014, 8 (4) : 629-641 DOI:10.1007/s11704-014-3127-5

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Zhou G, Su J. Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002, 473-480

[2]	Chieu H L, Ng H T. Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th International Conference on Computational Linguistics. 2002, 1: 1-7

[3]	Takeuchi K, Collier N. Use of support vector machines in extended named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning. 2002, 20: 1-7

[4]	Settles B. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 104-107

[5]	Florian R, Ittycheriah A, Jing H, Zhang T. Named entity recognition through classifier combination. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 168-171

[6]	Klein D, Smarr J, Nguyen H, Manning C D. Named entity recognition with character-level models. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 180-183

[7]	Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C. Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics, 2005, 6(Suppl 1): S5

[8]	Ciaramita M, Altun Y. Named-entity recognition in novel domains with external lexical knowledge. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. 2005, 209-212

[9]	Resnik P, Smith N A. The web as a parallel corpus. Computational Linguistics, 2003, 29(3): 349-380

[10]	Zhang Y, Wu K, Gao J, Vines P. Automatic acquisition of chinese–english parallel corpus from the web. In: Proceedings of the 28th European Conference on Advances in Information Retrieval. 2006, 420-431

[11]	Fu R, Qin B, Liu T. Generating Chinese named entity data from a parallel corpus. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 264-272

[12]	Yarowsky D, Ngai G, Wicentowski R. Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the 1st International Conference on Human Language Technology Research. 2001, 1-8

[13]	Huang F, Vogel S. Improved named entity translation and bilingual named entity extraction. In: Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. 2002, 253-258

[14]	Burkett D, Petrov S, Blitzer J, Klein D. Learning better monolingual models with unannotated bilingual text. In: Proceedings of the 14th Conference on Computational Natural Language Learning. 2010, 46-54

[15]	Hassan A, Fahmy H, Hassan H. Improving named entity translation by exploiting comparable and parallel corpora. In: Proceedings of the 2007 Workshop on Acquisition and Management of Multilingual Lexicons. 2007, 1-6

[16]	An J, Lee S, Lee G G. Automatic acquisition of named entity tagged corpus from world wide web. In: Proceedings of the 41st AnnualMeeting on Association for Computational Linguistics. 2003, 2: 165-168

[17]	Whitelaw C, Kehlenbeck A, Petrovic N, Ungar L. Web-scale named entity recognition. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 123-132

[18]	Richman A E, Schone P. Mining Wiki resources for multilingual named entity recognition. In: Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics. 2008, 1-9

[19]	Nothman J, Curran J R, Murphy T. Transforming wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop. 2008, 124-132

[20]	Vlachos A, Gasperin C. Bootstrapping and evaluating named entity recognition in the biomedical domain. In: Proceedings of the 2006 HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006, 138-145

[21]	Ma X. Champollion: A robust parallel text sentence aligner. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006, 489-492

[22]	Och F J, Ney H. Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. 2000, 440-447

[23]	Koehn P, Och F J, Marcu D. Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 2003, 1: 48-54

[24]	Lafferty J. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 282-289

[25]	Grishman R, Sundheim B. Message understanding conference-6: a brief history. In: Proceedings of the 1996 International Conference on Computational Linguistics. 1996, 466-471

[26]	Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investigationes, 2007, 30(1): 3-26

[27]	De Sitter A, Calders T, Daelemans W. A formal framework for evaluation of information extraction. Technical Report TR 2004–0, Department of Mathematics and Computer Science, Univer<?Pub Caret?>sity of Antwerp, 2004

[28]	Che W, Li Z, Liu T. LTP: a Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. 2010, 13-16

[29]	Wu Y, Zhao J, Xu B, Yu H. Chinese named entity recognition based on multiple features. In: Proceedings of the 2005 Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005, 427-434

[30]	Zhang Y, Vogel S, Waibel A. Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 International Conference on Language Resources and Evaluation. 2004, 2051-2054

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag Berlin Heidelberg

PDF (779KB)

1581

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS