A New Machine-Learning Extracting Approach to Construct a Knowledge Base: A Case Study on Global Stromatolites over Geological Time
Xiaobo Zhang , Hao Li , Qiang Liu , Zhenhua Li , Claire E. Reymond , Min Zhang , Yuangeng Huang , Hongfei Chen , Zhong-Qiang Chen
Journal of Earth Science ›› 2023, Vol. 34 ›› Issue (5) : 1358 -1373.
A New Machine-Learning Extracting Approach to Construct a Knowledge Base: A Case Study on Global Stromatolites over Geological Time
Within any scientific disciplines, a large amount of data are buried within various literature depositories and archives, making it difficult to manually extract useful information from the datum swamps. The machine-learning extraction of data therefore is necessary for the big-data-based studies. Here, we develop a new text-mining technique to reconstruct the global database of the Precambrian to Recent stromatolites, providing better understanding of secular changes of stromatolites though geological time. The step-by-step data extraction process is described as below. First, the PDF documents of stromatolite-containing literatures were collected, and converted into text formation. Second, a glossary and tag-labeling system using NLP (Natural Language Processing) software was employed to search for all possible candidate pairs from each sentence within the papers collected here. Third, each candidate pair and features were represented as a factor graph model using a series of heuristic procedures to score the weights of each pair feature. Occurrence data of stromatolites versus stratigraphical units (abbreviated as Strata), facies types, locations, and age worldwide were extracted from literatures, respectively, and their extraction accuracies are 92%/464, 87%/778, 92%/846, and 93%/405 from 3 750 scientific abstracts, respectively, and are 90%/1 734, 86%/2 869, 90%/2 055 and 91%/857 from 11 932 papers, respectively. A total of 10 072 unique datum items were identified. The newly obtained stromatolite dataset demonstrates that their stratigraphical occurrences reached a pronounced peak during the Proterozoic (2 500–541 Ma), followed by a distinct fall during the Early Phanerozoic, and overall fluctuations through the Phanerozoic (541–0 Ma). Globally, seven stromatolite hotspots were identified from the new dataset, including western United States, eastern United States, western Europe, India, South Africa, northern China, and southern China. The proportional occurrences of inland aquatic stromatolites remain rather low (∼ 20%) in comparison to marine stromatolites from the Precambrian to Jurassic, and then display a significant increase (30%–70%) from the Cretaceous to the present.
machine learning / knowledge base construction / stromatolites / Precambrian / knowledge graph
| [1] |
Al-Badrashiny, M., Bolton, J., Chaganty, A. T., et al., 2017. Tinkerbell: Cross-Lingual Cold-Start Knowledge Base Construction. The 2017 Text Analysis Conference, TAC 2017, November 13–14, Gaithersburg |
| [2] |
|
| [3] |
Angeli, G., Gupta, S., Jose, M., et al., 2014. Stanford’s 2014 Slot Filling Systems. TAC KBP, 695 |
| [4] |
Banon, S., 2021. Elasticsearch. https://www.elastic.co/ |
| [5] |
Carlson, A., Betteridge, J., Kisiel, B., et al., 2010. Toward an Architecture for Never-Ending Language Learning. The Twenty-Fourth AAAI Conference on Artificial Intelligence. July 11–15, 2010, Atlanta, Georgia. ACM, New York, 1306–1313. https://doi.org/10.5555/2898607.2898816 |
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
Cohen, K. M., Finney, S. C., Gibbard, P. L., et al., 2013. The Ics International Chronostratigraphic Chart. Episodes 36, 199–204. Community, A. P., 2021. Apache PDFBox A Java PDF Library. https://pdfbox.apache.org/ |
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
Foundation, A. S., 2021. Apache Tika-Apache Tika. https://tika.apache.org |
| [15] |
|
| [16] |
Glyph, C., 2021. Xpdf. http://www.xpdfreader.com/ |
| [17] |
|
| [18] |
|
| [19] |
Hoffmann, R., Zhang, C., Weld, D. S., 2010. Learning 5 000 Relational Extractors. The 48th Annual Meeting of the Association for Computational Linguistics, July 13, 2010, Uppsala |
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
Kruiper, R., Vincent, J. F. V., Chen-Burger, J., et al., 2020. In Layman’s Terms: Semi-Open Relation Extraction from Scientific Texts. arXiv: 2005.07751. https://arxiv.org/abs/2005.07751 |
| [25] |
|
| [26] |
Liu, J., Wright, S. J., Ré, C., et al., 2014. An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. The 31st International Conference on International Conference on Machine Learning-Volume 32. June 21–26, 2014, Beijing. https://doi.org/10.5555/3044805.3044945 |
| [27] |
Lowagie, B., 2021. The Leading Pdf Library for Developers Itext. https://itextpdf.com/en |
| [28] |
|
| [29] |
|
| [30] |
Microsoft, 2021. Microsoft Academic Graph. https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/ |
| [31] |
Mintz, M., Bills, S., Snow, R., et al., 2009. Distant Supervision for Relation Extraction without Labeled Data. The Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. August 2–7, 2009, Singapor |
| [32] |
|
| [33] |
|
| [34] |
Niu, F., Recht, B., Re, C., et al., 2011. HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. The 24th International Conference on Neural Information Processing Systems. December 12–15, 2011, Granada. https://doi.org/10.5555/2986459.2986537 |
| [35] |
|
| [36] |
|
| [37] |
Paleobiodb, 2021. The Paleobiology Database. https://paleobiodb.org/ |
| [38] |
Peters, S. E., Zhang, C., Livny, M., et al., 2014. A Machine-Compiled Macroevolutionary History of Phanerozoic Life. arXiv: 1406.2963. https://arxiv.org/abs/1406.2963 |
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
Riloff, E., Jones, R., 1999. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping, In: Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, July 18–22, 1999, Orlando, 474–479 |
| [45] |
Shinyama, Y., 2021. Pdfminer. https://www.unixuser.org/∼euske/python/pdfminer/ |
| [46] |
|
| [47] |
|
| [48] |
Translated, 2021. Mymemory. https://mymemory.translated.net/ |
| [49] |
Vaswani, A., Shazeer, N., Parmar, N., et al., 2017. Attention is All You Need, In: Guyon, I., Luxburg, U. V., Bengio, S., eds., Advances in Neural Information Processing Systems, Curran Associates, Inc. https://doi.org/10.48550/arxiv.1706.03762 |
| [50] |
|
| [51] |
|
| [52] |
Webber, B., 2009. Discourse—Early Problems, Current Successes, Future Challenges. The Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. August 2–7, 2009, Singapore |
| [53] |
Wondershare Technology Group Co. Limited, 2021. Wondershare PDFelement. https://pdf.wondershare.com/ |
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
Zhang, C., Govindaraju, V., Borchardt, J., et al., 2013. GeoDeepDive: Statistical Inference Using Familiar Data-Processing Languages. The 2013 ACM SIGMOD International Conference on Management of Data. June 22–27, 2013, New York. https://doi.org/10.1145/2463676.246 3680 |
| [58] |
Zhang, C., Ré, C., 2013. Towards High-Throughput Gibbs Sampling at Scale: A Study across Storage Managers. The 2013 ACM SIGMOD International Conference on Management of Data. June 22–27, 2013, New York. https://doi.org/10.1145/2463676.2463702 |
/
| 〈 |
|
〉 |