Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach
Qinjun Qiu , Miao Tian , Zhong Xie , Yongjian Tan , Kai Ma , Qingfang Wang , Shengyong Pan , Liufeng Tao
Journal of Earth Science ›› 2023, Vol. 34 ›› Issue (5) : 1406 -1417.
Extracting Named Entity Using Entity Labeling in Geological Text Using Deep Learning Approach
Artificial intelligence (AI) is the key to mining and enhancing the value of big data, and knowledge graph is one of the important cornerstones of artificial intelligence, which is the core foundation for the integration of statistical and physical representations. Named entity recognition is a fundamental research task for building knowledge graphs, which needs to be supported by a high-quality corpus, and currently there is a lack of high-quality named entity recognition corpus in the field of geology, especially in Chinese. In this paper, based on the conceptual structure of geological ontology and the analysis of the characteristics of geological texts, a classification system of geological named entity types is designed with the guidance and participation of geological experts, a corresponding annotation specification is formulated, an annotation tool is developed, and the first named entity recognition corpus for the geological domain is annotated based on real geological reports. The total number of words annotated was 698 512 and the number of entities was 23 345. The paper also explores the feasibility of a model pre-annotation strategy and presents a statistical analysis of the distribution of technical and term categories across genres and the consistency of corpus annotation. Based on this corpus, a Lite Bi-directional Encoder Representations from Transformers (ALBERT)- Bi-directional Long Short-Term Memory (BiLSTM)-Conditional Random Fields (CRF) and ALBERT-BiLSTM models are selected for experiments, and the results show that the F1-scores of the recognition performance of the two models reach 0.75 and 0.65 respectively, providing a corpus basis and technical support for information extraction in the field of geology.
ontology / geological reports / named entity recognition / geological corpus construction / semi-automated annotation platforms / deep learning
| [1] |
Aone, C., Halverson, L., Hampton, T., et al., 1998. SRA: Description of the IE2 System Used for MUC-7. Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, April 29–May 1, Virginia |
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
Black, W. J., Rinaldi, F., Mowatt, D., 1998. FACILE: Description of the NE System Used for MUC-7. The Seventh Message Understanding Conference (MUC-7), April 29–May 1, Virginia |
| [6] |
|
| [7] |
|
| [8] |
Carletta, J., 1996. Assessing Agreement on Classification Tasks: The Kappa Statistic. arXiv: cmp-lg/9602004. https://arxiv.org/abs/cmp-lg/9602004 |
| [9] |
Chen, W., Zhang, Y., Isahara, H., 2006. Chinese Named Entity Recognition with Conditional Random Fields. The Fifth SIGHAN Workshop on Chinese Language Processing. 22–23 July 2006, Sydney |
| [10] |
|
| [11] |
|
| [12] |
Devlin, J., Chang, M. W., Lee, K., et al., 2018. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805. https://arxiv.org/abs/1810.04805 |
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
Huang, M. S., Lai, P. T., Tsai, R. T. H., et al., 2019. Revised JNLPBA Corpus: A Revised Version of Biomedical NER Corpus for Relation Extraction Task. arXiv: 1901.10219. https://doi.org/10.1093/bib/bbaa054 |
| [21] |
Humphreys, K., Gaizauskas, R., Azzam, S., et al., 1998. University of Sheffield: Description of the LaSIE-II System as Used for MUC-7. Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29–May 1, 1998 |
| [22] |
Isozaki, H., Kazawa, H., 2002. Efficient Support Vector Classifiers for Named Entity Recognition. Proceedings of the 19th International Conference on Computational Linguistics-Volume 1. 24 August–1 September, 2002, Taipei. https://doi.org/10.3115/1072228.1072282 |
| [23] |
|
| [24] |
Krupka, G., IsoQuest, K., 2005. Description of the Nerowl Extractor System as Used for muc-7. Proceedings of the 7th Message Understanding Conference, Virginia |
| [25] |
Lan, Z. Z., Chen, M. D., Goodman, S., et al., 2019. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv: 1909.11942. https://arxiv.org/abs/1909.11942 |
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
Ma, X. Z., Hovy, E., 2016. End-to-End Sequence Labeling via Bi-Directional LSTM-CNNS-CRF. arXiv: 1603.01354. https://arxiv.org/abs/1603.01354 |
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
Peters, M. E., Neumann, M., Iyyer, M., et al., 2018. Deep Contextualized Word Representations. arXiv: 1802.05365. https://arxiv.org/abs/1802.05365 |
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
Schiffries, C. M., Wang, C., Hazen, R., et al., 2020. The Deep-Time Digital Earth Program: Data Driven Discovery in the Geosciences. AGU Fall Meeting 2020, 1–17 December, online |
| [42] |
|
| [43] |
Verhagen, M., Saurí, R., Caselli, T., et al., 2010. SemEval-2010 Task 13: TempEval-2. Proceedings of the 5th International Workshop on Semantic Evaluation. July 15–16, 2010, Los Angeles. https://doi.org/10.5555/1859664.1859674 |
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
Zhang, Y., Yang, J., 2018. Chinese NER Using Lattice LSTM. arXiv: 1805.02023. https://arxiv.org/abs/1805.02023 |
| [57] |
|
| [58] |
|
/
| 〈 |
|
〉 |