Named entity recognition for Chinese construction documents based on conditional random field

Qiqi ZHANG , Cong XUE , Xing SU , Peng ZHOU , Xiangyu WANG , Jiansong ZHANG

Front. Eng ›› 2023, Vol. 10 ›› Issue (2) : 237 -249.

PDF (829KB)
Front. Eng ›› 2023, Vol. 10 ›› Issue (2) : 237 -249. DOI: 10.1007/s42524-021-0179-8
Construction Engineering and Intelligent Construction
RESEARCH ARTICLE

Named entity recognition for Chinese construction documents based on conditional random field

Author information +
History +
PDF (829KB)

Abstract

Named entity recognition (NER) is essential in many natural language processing (NLP) tasks such as information extraction and document classification. A construction document usually contains critical named entities, and an effective NER method can provide a solid foundation for downstream applications to improve construction management efficiency. This study presents a NER method for Chinese construction documents based on conditional random field (CRF), including a corpus design pipeline and a CRF model. The corpus design pipeline identifies typical NER tasks in construction management, enables word-based tokenization, and controls the annotation consistency with a newly designed annotating specification. The CRF model engineers nine transformation features and seven classes of state features, covering the impacts of word position, part-of-speech (POS), and word/character states within the context. The F1-measure on a labeled construction data set is 87.9%. Furthermore, as more domain knowledge features are infused, the marginal performance improvement of including POS information will decrease, leading to a promising research direction of POS customization to improve NLP performance with limited data.

Graphical abstract

Keywords

NER / NLP / Chinese language / construction document

Cite this article

Download citation ▾
Qiqi ZHANG, Cong XUE, Xing SU, Peng ZHOU, Xiangyu WANG, Jiansong ZHANG. Named entity recognition for Chinese construction documents based on conditional random field. Front. Eng, 2023, 10(2): 237-249 DOI:10.1007/s42524-021-0179-8

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Al Qady M, Kandil A (2010). Concept relation extraction from construction documents using natural language processing. Journal of Construction Engineering and Management, 136(3): 294–302

[2]

Al Qady M, Kandil A (2013). Document discourse for managing construction project documents. Journal of Computing in Civil Engineering, 27(5): 466–475

[3]

Al Qady M, Kandil A (2015). Automatic classification of project documents on the basis of text content. Journal of Computing in Civil Engineering, 29(3): 04014043

[4]

Caldas C H, Soibelman L (2002). Implementing automated methods for document classification in construction management information systems. In: Proceedings of the International Workshop on Information Technology in Civil Engineering. Washington, D.C.: ASCE, 194–210

[5]

Che W, Li Z, Liu T (2010). LTP: A Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Beijing: Association for Computational Linguistics, 13–16

[6]

Chen H, Luo X (2019). An automatic literature knowledge graph and reasoning network modeling framework based on ontology and natural language processing. Advanced Engineering Informatics, 42: 100959

[7]

Dai Z, Wang X, Ni P, Li Y, Li G, Bai X (2019). Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. In: Proceedings of 12th International Congress on Image and Signal Processing, Biomedical Engineering and Informatics (CISP-BMEI). Suzhou: IEEE, 1–5

[8]

Devlin J, Chang M W, Lee K, Toutanova K (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint,

[9]

Fan H, Xue F, Li H (2015). Project-based as-needed information retrieval from unstructured AEC documents. Journal of Management Engineering, 31(1): A4014012

[10]

Frantzi K, Ananiadou S, Mima H (2000). Automatic recognition of multi-word terms: The C-value/NC-value method. International Journal on Digital Libraries, 3(2): 115–130

[11]

Gangadharan V, Gupta D (2020). Recognizing named entities in agriculture documents using LDA based topic modelling techniques. Procedia Computer Science, 171: 1337–1345

[12]

Goyal A, Gupta V, Kumar M (2018). Recent named entity recognition and classification techniques: A systematic review. Computer Science Review, 29: 21–43

[13]

Hahm G J, Lee J H, Suh H W (2015). Semantic relation based personalized ranking approach for engineering document retrieval. Advanced Engineering Informatics, 29(3): 366–379

[14]

Huang Z, Xu W, Yu K (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint,

[15]

Jauregi Unanue I, Zare Borzeshi E, Piccardi M (2017). Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. Journal of Biomedical Informatics, 76: 102–109

[16]

Kwayu K M, Kwigizile V, Zhang J, Oh J S (2020). Semantic n-gram feature analysis and machine learning-based classification of drivers’ hazardous actions at signal-controlled intersections. Journal of Computing in Civil Engineering, 34(4): 04020015

[17]

Lafferty J, McCallum A, Pereira F C N (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning. San Francisco, CA: ACM, 282–289

[18]

Le T, Jeong H D, Gilbert S B, Chukharev-Hudilainen E (2018). Parsing natural language queries for extracting data from large-scale geospatial transportation asset repositories. In: Proceedings of Construction Research Congress. New Orleans, LA: ASCE, 70–79

[19]

Lee J, Yi J S, Son J (2019). Development of automatic-extraction model of poisonous clauses in international construction contracts using rule-based NLP. Journal of Computing in Civil Engineering, 33(3): 04019003

[20]

Leskovec J (2013). Web data: Amazon reviews. Available at:

[21]

Li S, Cai H, Kamat V R (2016). Integrating natural language processing and spatial reasoning for utility compliance checking. Journal of Construction Engineering and Management, 142(12): 04016074

[22]

Li Z, Sun M (2009). Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics, 35(4): 505–512

[23]

Liu K, El-Gohary N (2017). Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in Construction, 81: 313–327

[24]

Liu X, Zhou M (2013). Two-stage NER for tweets with clustering. Information Processing & Management, 49(1): 264–273

[25]

Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8): 1381–1388

[26]

Lv X, El-Gohary N M (2016a). Semantic annotation for supporting context-aware information retrieval in the transportation project environmental review domain. Journal of Computing in Civil Engineering, 30(6): 04016033

[27]

Lv X, El-Gohary N M (2016b). Enhanced context-based document relevance assessment and ranking for improved information retrieval to support environmental decision making. Advanced Engineering Informatics, 30(4): 737–750

[28]

Majumder M, Barman U, Prasad R, Saurabh K, Saha S K (2012). A novel technique for name identification from homeopathy diagnosis discussion forum. Procedia Technology, 6: 379–386

[29]

Manning C D, Schutze H (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press

[30]

Poibeau T, Kosseim L (2001). Proper name extraction from non-journalistic texts. In: Proceedings of 11th Computational Linguistics in the Netherlands. Tilburg: Brill, 144–157

[31]

Pustejovsky J, Stubbs A (2012). Natural Language Annotation for Machine Learning: A Guide to Corpus-building for Applications. Sebastopol, CA: O’Reilly Media

[32]

Quimbaya A P, Múnera A S, Rivera R A G, Rodríguez J C D, Velandia O M M, Peña A A G, Labbé C (2016). Named entity recognition over electronic health records through a combined dictionary-based approach. Procedia Computer Science, 100: 55–61

[33]

Saha S K, Mitra P, Sarkar S (2012). A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition. Knowledge-Based Systems, 27: 322–332

[34]

Singhal A (2001). Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin, 24(4): 35–43

[35]

Sun J, Lei K, Cao L, Zhong B, Wei Y, Li J, Yang Z (2020). Text visualization for construction document information management. Automation in Construction, 111: 103048

[36]

Tixier A J P, Hallowell M R, Rajagopalan B, Bowman D (2016). Automated content analysis for construction safety: A natural language processing system to extract precursors and outcomes from unstructured injury reports. Automation in Construction, 62: 45–56

[37]

Xu X, Cai H (2020). Semantic approach to compliance checking of underground utilities. Automation in Construction, 109: 103006

[38]

Yu S, Duan H, Wu Y (2018). Corpus of multi-level processing for modern Chinese. Available at: in Chinese)

[39]

Zhang F, Fleyeh H, Wang X, Lu M (2019). Construction site accident analysis using text mining and natural language processing techniques. Automation in Construction, 99: 238–248

[40]

Zhang J, El-Gohary N M (2015). Automated information transformation for automated regulatory compliance checking in construction. Journal of Computing in Civil Engineering, 29(4): B4015001

[41]

Zhang J, El-Gohary N M (2016). Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking. Journal of Computing in Civil Engineering, 30(2): 04015014

[42]

Zhou P, El-Gohary N (2016). Ontology-based multilabel text classification of construction regulatory documents. Journal of Computing in Civil Engineering, 30(4): 04015058

[43]

Zhou P, El-Gohary N (2017). Ontology-based automated information extraction from building energy conservation codes. Automation in Construction, 74: 103–117

[44]

Zou Y, Kiviniemi A, Jones S W (2017). Retrieving similar cases for construction project risk management using Natural Language Processing techniques. Automation in Construction, 80: 66–76

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (829KB)

4544

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/