1 Introduction
2 Related work
2.1 Natural language processing (NLP) in construction
2.2 Named entity recognition (NER) in construction
2.2.1 Rule-based NER method
Tab.1 Examples of word variation and/or abbreviation |
Word | Variation/Abbreviation |
---|---|
外悬挑梁 (cantilever beam) | 悬挑外梁 |
加气混凝土砌块 (aerated concrete block) | 混凝土加气块, 加气块 |
水泥粉喷搅拌桩 (cement powder spray pile) | 粉喷搅拌桩, 粉喷桩 |
2.2.2 Learning-based NER method
2.3 Conditional random field (CRF) model
3 CRF-based NER for Chinese construction documents
3.1 Corpus design strategy
3.1.1 Task definition
Tab.2 NER tasks, target NEs, and associated construction documents |
Task | Target NEs | Construction document |
---|---|---|
Identification of responsibility and legal issues | Organization Party Law/Regulation | Contract Bidding document correspondence Other formal communication |
Cost analysis | Material Equipment Building parts | Progress report Project quota |
Progress analysis | Material Equipment Building parts | Construction plan Progress report Daily log Meeting minutes |
Quality analysis | Material Building parts Construction techniques | Construction plan Quality report Daily log Meeting minutes |
Safety analysis | Equipment Building parts Date Body parts Injury types | Construction plan Safety report Daily log Meeting minutes |
3.1.2 Word segmentation
Tab.3 Procedures of the ensemble method |
// Prepare initial tag sequences before fusion: |
[1] For each sentence , where represents the th character in , record the results from the three segmentation tools as , , and [2] Mark the first character of each word by tag “F” and the rest by tag “E” in , , and [3] Let represents the tag sequence of , where represents the th tag of the th character in |
// Form the final tag sequence by fusion: |
[4] For each character in , a list exists [5] Count the number of tag “F” and tag “E” in as and , respectively [6] Assign “F” or “E” as the final tag of character on the basis of and form a final tag sequence // Transform the final tag sequence into segmented tokens as the fusion result: |
[7] Scan from left to right; mark an F before an F as an individual token [8] Mark an F before an E as the beginning of a token; mark an E before an F as the end of a token; search back for the nearest beginning of a token; and combine the beginning–end pair together with every character in between as a token |
Tab.4 Accuracy comparison |
Model | LTP | Jieba | THULAC | Ensemble |
---|---|---|---|---|
Accuracy | 0.905 | 0.914 | 0.918 | 0.963 |
3.1.3 Named entity (NE) annotation
Tab.5 Tag representation |
Tag location | Beginning | Inside | Outside |
---|---|---|---|
Tag representation | B | I | O |
Tab.6 Nested NE element types |
Types | Example (in Chinese) | Example (in English) |
---|---|---|
Location | 顶层, 一区 | top floor, zone one |
Building components | 梁, 墙, 10#塔吊 | beam, wall, 10# tower crane |
Building material | 混凝土, 砖, 钢 | concrete, brick, steel |
Tab.7 Annotation consistency matrix |
Annotator | A | B | C | Average Kappa |
---|---|---|---|---|
A | – | 92.8% | 93.9% | 93.4% |
B | 92.8% | – | 92.3% | 92.6% |
C | 93.9% | 92.3% | – | 93.1% |
Average | 93.4% | 92.6% | 93.1% | 93.0% |
3.2 CRF-based NER model
3.2.1 Transformation features (TFs)
3.2.2 State features (SFs) — POS
3.2.3 State features (SFs) — Word state and character state
Tab.8 Types of words/characters in a sentence |
Word | Tag | Word type | Character type |
---|---|---|---|
请 | O | ||
施工 | O | ||
单位 | O | ||
尽快 | O | ||
上报 | O | Left-hand indicator | |
人工 | B | Modifier | “工” is a single modifier suffix and “人工” is a double modifier suffix |
挖孔 | I | Modifier | “孔” is a single modifier suffix and “挖孔” is a double modifier suffix |
桩 | I | Kernel | “桩” is a kernel suffix |
和 | O | Left-hand indicator and right-hand indicator | |
土钉 | B | Modifier | “钉” is a single modifier suffix and “土钉” is a double modifier suffix |
墙 | I | Modifier | “墙” is a modifier suffix |
锚杆 | I | Kernel | “杆” is a single kernel suffix and “锚杆” is a double kernel suffix |
的 | O | Right-hand indicator | |
变更 | O | ||
费用 | O | ||
。 | O |
4 Results analysis
Tab.9 Selected thresholds of statistical feature |
Threshold name | Value |
---|---|
| 3 |
| 1 |
| 10 |
4.1 Model performance
Tab.10 Performance of the CRF model |
Tags | Precision | Recall | F1-measure |
---|---|---|---|
B | 0.835 | 0.790 | 0.812 |
I | 0.892 | 0.816 | 0.853 |
O | 0.954 | 0.990 | 0.972 |
Average | 0.879 |
Tab.11 Performance comparison |
Model | Introduced model | Bi-LSTM-CRF | BERT-Bi-LSTM-CRF |
---|---|---|---|
F1-measure | 0.879 | 0.813 | 0.827 |
4.2 Effects of POSF on performance
Tab.12 F1-measures of different feature combinations |
No. | Features | F1-measure of tag B | F1-measure of tag I | F1-measure of tag O | Average |
---|---|---|---|---|---|
1 | TF, CF | 0.653 | 0.766 | 0.944 | 0.788 |
2 | TF, CF, POSF | 0.742 | 0.809 | 0.957 | 0.836 |
3 | TF, WF | 0.745 | 0.814 | 0.975 | 0.845 |
4 | TF, WF, POSF | 0.781 | 0.839 | 0.977 | 0.866 |
5 | TF, WF, CF | 0.798 | 0.859* | 0.978* | 0.878 |
6 | TF, WF, CF, POSF | 0.812* | 0.853 | 0.972 | 0.879* |
Note: * marks the largest number in each column. |
Tab.13 Parameters of the SFs (top six) |
Tag location | Weight | Feature |
---|---|---|
1.968 | ||
1.472 | ||
1.207 | ||
1.201 | ||
1.191 | ||
−1.635 | ||
3.056 | ||
1.601 | ||
1.124 | ||
−1.296 | ||
−1.460 | ||
−1.574 | ||
2.841 | ||
2.064 | ||
1.964 | ||
−2.168 | ||
−2.454 | ||
−3.077 |
Notes: Definition of the POS tags: a – adjective, b – other noun-modifier, p – preposition, q – quantity, wp – punctuation, ws – foreign words, n – general noun, nd – direction noun, nh – person’s name, nz – other proper noun, nt – temporal noun. |