Introduction
Health records contain valuable clinical information which can be used for improvement of disease treatment and for medical research. Most information in health records is represented in an unstructured format, such as free text; and thus the knowledge is not easily obtained from the records for automatic analysis and decision.
In this research, we attempt to extract medical terms from health records of traditional Chinese medicine (TCM). Unlike modern western medicine, TCM medical terms have issues of flexibility and unclear boundaries, making them hard to be extracted accurately [
1].
Generally, current term extraction has three different types of methods: rule-based methods, statistical methods, and a combination of both. A rule-based method usually constructs a dictionary of common words and finds new terms by rule templates [
2,
3]. A statistical method takes advantage of statistical characteristics of terms, such as mutual information or Chi-square [
4,
5]. Often, machine learning methods, like CRF(conditional random field) and HMM (hidden Markov model), are also used [
6,
7].
A rule-based method is simple and it has a high precision, especially for low-frequency terms. However, it requires a good linguistical knowledge base, and completeness and reasonability of rules are difficult to be guaranteed. In addition, unlike other types of terms, TCM terms are highly compressed and have numerous variations, which add more difficulties to rule-based methods.
Machine learning methods require less domain knowledge and linguistic background. It has a large feasibility and can achieve automation. In consideration of a variety of characteristics of terms, it can get a good effect on term recognition. But without semantic relations of terms, such methods often fail. In other words, machine learning methods have a very poor performance when they confront low-frequency terms or non-normative linguistic expressions. Moreover, machine learning methods cannot be used for large-scale text processing.
The aim of this research is to build a method of extracting clinical terms from TCM health records. A record may contain much information, and our focus is mainly on the portions of 现病史(history of present illness), 刻下症(current symptoms) and 既往史(history of past illness) of a health record. Our method begins with some “seed” words. Using POS (part of speech) tagging, rules and likelihood ratio, it finds new terms in its iterative manner, and expands the seeds in each iteration.
The paper is organized as follows. First, the method of TCM term extraction is introduced. Second, the experimental results is presented. Third, the issues encountered in our method and experiments are discussed.
Methods
Word tokenization
The work began with word tokenization. At first, ICTCLAS, a Chinese tokenizer, was used [
8]. However, due to the particularity of TCM health records, the results were not satisfactory. The sentences were often partitioned into single characters instead of phrases. So, in order to make the results better, we used a method based on likelihood ratio [
9]. In this procedure, the words with high correlation were merged. As for POS tagging, considering the characteristics of the Chinese language, a merged word took the POS of its last word as its new POS.
Find terms of body structures and clinical description
Usually a TCM clinical term can be divided into two parts: body structure and its description. Terms as the description part are huge, but body structure terms have a limited scope. Thus, body structure terms became a breakthrough point of term extraction in our study. So it was necessary to prepare a dictionary of body structure terms in advance.
From the results of word tokenization, it was easy to find which word contains body structure terms, but difficult to judge whether it is a body structure term or a description term, because a body structure term and its description might be merged together due to their high likelihood ratio.
Before judgment, the boundaries of terms could be improved by two rules. The first rule states that, if there is a qualificative prefix, such as “left” or “upper,” before a body structure term, merge them together. The second rule says that, if the number of characters from the end of a body structure term to a punctuation is less than or equal to the length of the body structure term itself, then identify the body structure term with words from its end until the punctuation as a TCM term or part of TCM term. The first rule is obvious. The reason for the second rule is that the expanded part often acts as a complement to articulate the body structure, and often acts as an attribute “term.”
There are also two rules used to judge the correctness of an extracted term, i.e., whether it is a body structure term or a descriptive one:
(1) If the extracted term occurs in the dictionary of body structure terms, then assign it a tag “bs-term” (e.g., body-structure term).
(2) If the extracted term ends with a body structure term from the dictionary, then assign it a tag “bs-term”; otherwise “term.”
It is reasonable for the second rule because subject-verb phrases are more common in TCM terms than verb-object phrases and subordinate phrases.
Iterated term extraction
Now, we had got some body structure terms and description terms from the corpus, but it was not enough. Some terms might not contain the body structure terms in the dictionary, or could not be found due to some “noisy words.” To solve these problems, an iterative method was designed, and it performed the following steps in each iteration.
(1) Use following rules to find new terms or to improve the boundaries of body structure terms and description terms. If several adjacent words have such POSs (or tags), merge them together and give it a new tag.
a) bs-term+ bs-term= bs-term (e.g., “腰” + “腿” = “腰腿”(waist and legs))
b) qualificative prefix+ bs-term= bs-term (e.g., “左” + “手” = “左手”(left hand))
c) bs-term+ descriptive= term (e.g., “头” + “痛” = “头痛”(headache))
d) bs-term+ conjunctive+ descriptive= term (e.g., “脉” + “: ” + “弦” = “脉:弦”(taut pulse))
e) bs-term+ conjunctive+ term= term (e.g., “左侧上肢” + “及” + “下肢偏瘫” = “左侧上肢及下肢偏瘫”(hemiplegia of left hand and left leg))
f) term+ descriptive= term (e.g., “脉弦” + “细” = “脉弦细”(taut and thready pulse))
g) term+ conjunctive+ descriptive= term (e.g., “口干” + “且” + “苦” = “口干且苦”(dry and bitter mouth))
h) Expand terms: if the number of characters from the end (head) of a term to a punctuation is less than or equal to its length, expand it to the punctuation.
(2) Find descriptions from the known (and identified) terms and body structure terms.
As previously mentioned, a term could be a subject-verb phrase if it does not end with a body structure term. If so, take the right part of the term as descriptive samples; otherwise the left part. Next, enumerate the samples and the words whose tag is not “bs-term” or “term.” If their minimum edit distance is larger than the half of their maximum length, assign the word a tag “descriptive.” This is a method to find descriptive words from terms. For body structure terms, if its adjacent word is an adjective, a verb or an adverb, assign its adjacent word a tag “descriptive,” too.
(3) Verify these descriptions and find new body structure terms.
Due to the same reason as in the case of body structure terms, some descriptive words could also be terms. So the following rule was used: if consecutive descriptive words are between two punctuations, then merge them together and make it a new term, because it is an independent phrase and it “looks like” a term.
Finding new body structure terms just like the method to find descriptive words from body structure terms: if the adjacent word of a descriptive word is a noun, assign it a tag “bs-term.”
In all, no matter how irregular terms are, the expansion of body structure terms and descriptive words make almost all the possible terms were found in this stage, as long as the corpus is big enough.
Result optimization
Some terms in the corpus could still not be found, and there are two ways to improve this situation. First, considering consecutive words between punctuations, they are very likely to compose a term if one of the three following conditions is met: (1) its frequency is larger than three, (2) it is covered more than half of its length by words with tags “bs-term” or “descriptive,” and (3) its edit distance with a known term is larger than half of its maximum length. Since the terms in health records have a high degree of aggregation, frequency and similarity could be a strong evidence.
As another optimization, redundant parts of terms were removed. Some words in the terms are meaningless, such as 因 (because), 症见 (symptoms are), or 诊断为 (diagnosis). So, at last, we built up a list of unnecessary words and eliminated the redundancies from the results.
Results
The dictionary of body structure terms has 1817 terms from the Mesh. The source of the corpus is 42 health records of stroke mentioned in the related TCM literature, which totally contains 380 different phrases or terms. In our study, 406 terms were found, of which 48 terms were wrong. It follows that the result with a precision of 88.18% and a recall of 94.21%.
Table 1 summarizes the precision in each procedure. Table 2 shows some body structure terms found in descriptive terms. Table 3 shows optimized results. Table 4 gives some incorrect terms that the method found.
Discussion
The experiment has a satisfactory recall and precision rates. It shows that the iterative method based on seed terms is very effective for TCM term extraction. The combined use of ICTCLAS and the likelihood ratio generates a good tokenization result. Specifically, ICTCLAS did the task of conventional tokenization, while the likelihood ratio took advantage of terms’ statistical characteristics to ensure a reasonable tokenization of terms, avoiding terms to be tokenized into single Chinese characters or misleading words.
Table 1 shows that the preprocessing and the first iteration have a very high precision and cover nearly 70% of terms. This indicates two things. First, most terms are consistent with our hypothesis that they are composed of body structure terms and descriptive terms. Second, the rules can deal with the one-to-many relations in terms. These data suggest that the rules by now are enough to handle a task of term extraction if it has a perfect dictionary.
The precision rates of the second iteration and third iteration decrease significantly. The reason is that the POSs tagged by ICTCLAS are not always correct, because the TCM terms are highly summarized and abstract. In fact, ICTCLAS often tokenizes a conventional term into single Chinese characters, and the merging of words by the likelihood ratio makes POS tagging sometimes even worse.
Results of finding new body structure terms are shown in Table 2. We can see that it has found some good body structure terms, but there are also some irrelevant results which are caused by wrong POSs. However, another way to find new body structure terms by the similarity of descriptions behaved well. For example, the body structure term “言语” (Speech) is found by the similarity of “肢体活动不利” (Limbs cannot move well) and ”言语不利” (stammering) in the experiment. In most cases, however, similar descriptions only modify similar body structure terms.
Table 3 shows some improved results after optimization. In this procedure, it solved two weaknesses: noise and incompleteness of the dictionary, by the use of the removing word list, term similarity and frequency.
Table 4 reflects the limitations of the method. For undiscovered terms, it is very hard to extract a term without any body structure term, and the complex relationship of modification, POS tagging and tokenization hinders finding of these terms.
In our experiments, some incorrect terms were identified because it cannot distinguish a general description or a complement of a term with a true term. For example, our method found 舌质淡胖有(pale and swollen tongue with), 多由情志不遂(most due to emotional disturbance) and 为左大脑 (is the left-brain) as correct terms, but obviously they are not. These incorrect terms imply a more efficient method of term boundary identification needs to be researched.
Conclusions
Term extraction is a basic task in health record processing, and in this paper we presented a practical iterative extraction method for extracting terms from the records. The method is based on the assumption that TCM terms are composed of body-structure and descriptive terms, and uses a set of rules to find new terms. The method also uses the Mesh dictionary and the likelihood ratio technique, and achieves a precision rate of 88.18% and a recall rate of 94.21%.
Compliance with ethics guidelines
Cungen Cao and Meng Sun declare that they have no conflict of interest. This article does not contain any studies with human or animal subjects performed by any of the authors.
Higher Education Press and Springer-Verlag Berlin Heidelberg