INTRODUCTION
Traditional Chinese medicine (TCM) has a long history of over 2,000 years, and once played an important role in healthcare in pre-modern East Asia. As an important branch of alternative medicine, it has been becoming more and more popular worldwide in recent years, and attracting a lot of attentions from scientists of various disciplines. For example, Refs. [
1–
3] confirmed the unique treatment effects of acupuncture; Refs. [
4–
6] provided insights on how TCM prescriptions work via systematic interactions with biological regulation network; and, the 2015 Nobel Prize awarded to Prof. Yoyo Tu for her contribution to the discovery of artemisinin in 1977 casted lights on the great impact of TCM on human beings.
On the other hand, however, TCM is still mysterious to many people because of the unique philosophy, model and theoretical thinking behind it. Similar to any other healthcare systems, TCM also contains three basic components: (i) a toolbox of therapeutic technologies to treat patients, (ii) biomedical measurement instruments to observe and measure physical status of patients, and (iii) a diagnosis-treatment model (DTM) to map the biomedical observations and measurements of a patient to a “proper” therapy in the toolbox. But, due to the philosophical environment of ancient China and technical constraints in history, TCM developed these components in a unique way.
First, TCM therapies typically have complex internal structures. TCM prescriptions and acupuncture are the two primary therapies of TCM (although there are records of surgeries in the long history of TCM). A TCM prescription typically contains multiple ingredients, which may generate a mixture of hundreds of chemical compounds. An acupuncture therapy is usually composed of a series of acupunctures in different locations (called acupoints) of the patient’s body. The combinatorial or sequential nature of TCM therapies provides flexibility to tune treatment adaptively based on status of patients, but also posts great challenges in quality control and efficacy evaluation of TCM therapies. Second, due to the technology constraints in history, biomedical measurements of TCM heavily depend on subjective observation of doctors, and rely on natural language to deliver the experience. The combination of subjectivity of observations and flexibility of natural language may introduce multiple levels of bias and noise to the measurements, leading to critical technical barriers in data analysis. Third, built on top of the Chinese philosophy, the diagnosis-treatment model of TCM is described in a unique language involved many philosophical concepts in ancient China whose concrete meanings may change over time and be interpreted in different ways. This phenomenon makes it a challenging job to decode and understand the diagnosis-treatment model of TCM from a positive perspective.
All these features shaped TCM into a healthcare system with a unique knowledge representation style and deduction logic, which is very different from the modern healthcare system developed in the western world on top of anatomy and cell/molecular biology. In the past decades, many efforts have been given to build connections between TCM and modern sciences, trying to evaluate, understand and reinterpret TCM in a modern way. These efforts can be roughly classified into two categories: (i) the drug-discovery oriented research, which aims to identify potential drug candidates and validate them via randomized experiments [
7–
11]; and (ii) the theory-understanding oriented research, which focuses on revealing causal mechanism or association patterns of the diagnosis-treatment model behind TCM via data-driven approaches [
4,
12‒
18]. Although there are many difficult issues in practical implementation, the drug-discovery oriented research enjoys a relatively straightforward logic. The theory-understanding oriented research, however, often faces critical challenges at both methodology level and data level.
At the methodology level, it is very challenging to design data models that can precisely reflect TCM thinking and/or appropriately approximate generating procedure of TCM data. At the data level, a major problem is the lack of high quality data carrying stable signals about the full spectrum of TCM clinic practice. It’s not difficult to find a small-scale dataset with hundreds or thousands of patients from one TCM doctor. But, such a dataset is often biased to a small patient population of a certain disease. It’s also possible to assemble many small-scale datasets into a large-scale dataset. But, a dataset generated in this way is often a mixture of many inconsistent components, leaving many uncontrollable risks in downstream data analysis.
In this study, we introduce the Zhou Archive, a large-scale database of expert-specific Electronic Medical Records (EMRs), which contains comprehensive information about 73,000+ visits to one TCM doctor by 26,000+ distinct patients over 35 years from 1980 to 2015. From many perspectives, the archive provides an ideal opportunity to understand TCM in a data-driven way. First, the scale of archive is large enough to support many data-driven approaches. Second, the 73,000+ visits by 26,000+ patients cover 1,300+ diseases of 16 major disease categories, including cancers, digestive diseases, infectious diseases, neurological diseases, respiratory diseases, cardiovascular diseases, urinary diseases, rheumatism and so on, and are rich enough to reflect all aspects of TCM practice. Third, with data fields for symptoms of patients, TCM diagnosis and TCM treatment, the archive records all key components of the diagnosis-treatment model behind TCM, making it possible to decode the model in a data-driven way. Moreover, as all EMRs in the records come from one TCM doctor alone, the underlying logic of diagnosis-treatment model is more likely to be self-consistent, which is extremely important to the success of data-driven approaches. At last, except for classic TCM features, the archive also contains information about lab tests and diagnosis from the western medicine perspective, allowing us to connect TCM concepts with Western Medicine.
With the rise of medical big data and the popularity of precise medicine in recent years, real world study based on large scale EMRs has become an important paradigm in healthcare research [
19–
25]. We hope this study can open a door to this paradigm for TCM-related studies. Like most EMR data in practice, the data in the archive is a mixture of structured data fields which encode information with a well-design feature table, and semi-structured/unstructured data fields which deliver information via semi-structured or free texts. To transform the original EMR data into a well-structured feature table for which statistical analysis can be implemented, we need to discover a lot of TCM-specific and archive-specific technical terms from the archive, map them to their standard feature codes, and properly process the semi-structured and free Chinese texts in the archive to decode information effectively. In this paper, we proposed a systematic data processing framework to achieve this goal.
Based on the structured feature table obtained, a series of statistical analyses are implemented to learn principles of TCM clinical practice from the archive. Cross-category association patterns are discovered using various technical tools and embedding analysis is used on prescriptions and symptoms. Results from these analyses reveal insights to understand TCM from a data-driven perspective.
The remainder of this paper is organized as follows. “Description Of The Data” briefly introduces the data structure of the Zhou Archive. “Transfering Semi-Structured EMRs Into A Structured Feature Table” proposes a data processing framework to transform the original semi-structured and unstructured data from the archive to a well-structured feature table. In “Statistical Leaning Of The Structured Feature Table”, we analyze the structured feature table obtained with a series of statistical methods and extract some hidden patterns of this database. Finally, we summarize and discuss this study in the last section.
DESCRIPTION OF THE DATA
The archived EMRs contain 14 distinct data fields of 6 categories, including: (i) Patient ID and Demographics (ID, Gender, Age), (ii) Visit Date, (iii) Clinical Features (Symptoms, Tongue Picture, Pulse Type, Labe Tests), (iv) Western Medicine Diagnosis (Disease, Disease Category), (v) TCM Diagnosis (TCM Disease, TCM Pathogenesis) and (vi) TCM Treatments (TCM Therapy, TCM Prescription).
The 14 data fields can be classified into three types: 7 structured fields encoding information with well-designed codes (including Patient ID, Gender, Age, Visit Date, Disease, Disease Category, TCM Disease), 6 semi-structured fields encoding information with semi-structured texts (including Tongue Picture, Pulse Type, Lab Tests, TCM Pathogenesis, TCM Therapy, TCM Prescription), and 1 unstructured filed that delivers information with free texts (i.e., Symptoms). All these data fields contain missing values.
In the database, the column “Western Medicine Diagnosis” comes from the records of visiting western medicine doctors before coming to Prof. Zhou. These western medicine diagnoses were recorded by Prof. Zhou in the archive, each for one visit. Totally, 1,339 distinct diseases appear in the archive, which can be further classified into 16 disease categories, including: Cancers, Digestive Diseases (DD), Infectious Diseases (InD), Neurological Diseases (ND), Respiratory Diseases (RD), Cardiovascular Diseases (CD), Urinary Diseases (UD), Rheumatism, Gynopathy, Skin Diseases (SD), Hematopathy, Endocrine Diseases (ED), Orthopedic Diseases (OD), Ophthalmological and Otorhinolaryngological Diseases (OOD), Men Diseases (MD), and Miscellaneous Diseases (MiD). In terms of TCM diseases, however, only 394 disctinct TCM diseases appear, partially due to the higher missing rate of the TCM Disease field (88.4%) than the Disease field (17.5%). More detailed information about the patients covered by the archive is provided in Supplementary Figure S1A–S1E.
One third of the patients in the archive visited Prof. Zhou for multiple times. These patients with longitudinal records paid 4.7 visits on average within an average time span of 242 days, and the average time gap between two adjacent visits is 65 days. Supplementary Figure S1F‒S1H give the detailed distributions of visit frequency, overall time span and time gap between two adjacent visits of these patients. Researchers who are interested in this archive can check the website of Zhou Archive for TCM Study for detailed information on the data structure and data access.
TRANSFERING SEMI-STRUCTURED EMRs INTO A STRUCTURED FEATURE TABLE
With both structured data fields and semi-structured/unstructured data fields, the original archive is difficult to analyze. In this section, we transform the semi-structured and unstructured EMRs of the archive into a well-structured feature table, for which statistical analysis can be conveniently implemented. To achieve this goal, we process the structured fields, semi-structured fields and unstructured Symptoms field separately by different data processing strategies.
Figure 1 shows the route map of the data processing procedure, which digests the original archive as the input, and returns the following outputs: (i) a feature codebook which encodes all features generated from the archive, (ii) a term dictionary which fully covers the vocabulary specific to the archive (including all background words, common TCM terms and special terms used by Prof. Zhou), (iii) a term-feature map which links terms in and the standard feature codes they correspond to, and the most importantly, (iv) a well-organized structured feature table with columns for different features and rows for different records. Different from the raw data in the archive, which delivers information via semi-structured and unstructured texts, the transformed two-dimensional feature table encodes information with a well-designed data format and coding system.
There are a few critical challenges in this data processing procedure due to the semi-structured and unstructured texts in the archive. First, text segmentation and term discovery. As there are no visible word boundaries such as spaces in Chinese texts, the unstructured Chinese texts in the Symptoms field must be segmented into sequences of meaningful terms to decode information. However, because these texts contain many domain specific words, phrases and technical terms that are previously unknown, text segmentation is entangled with term discovery in this study. The combination of these two critical problems posts great challenges in processing the free texts in the archive. Second, standardization of technical terms. Due to the flexibility of free texts, many technical terms in the archive have multiple variates. To sufficiently extract information from the data, we need to map different variates of a technical term to its standard code. Third, we also need to understand the semantic meaning of semi-structured and free texts in the archive to precisely decode information.
Although many tools have been invented to process Chinese texts in the past decades, it is still not trivial to overcome above challenges in this study. Here, we propose an integrative data processing framework as a preliminary solution to this important but challenging problem. As the same problem will be encountered in many similar studies in the future, we hope that the framework we suggest can serve as a baseline solution for researchers in this field.
Processing the structured and semi-structured data fields
First, we process the structured and semi-structured data fields, transforming them into a feature table. Because the structured data fields already encode information with a well-designed feature codebook, it is straightforward to decode these fields to get the feature codebook and a feature table .
For the semi-structured data fields, however, we need to make extra efforts to collect technical terms in these fields and transform them into their standard feature codes. Taking advantage of the existing data structure in these semi-structured fields, a lot of technical terms can be conveniently extracted. For example, tongue/pulse-related terms and lab tests in the Clinic Feature fields, terms in the TCM Pathogenesis and TCM Therapy fields, as well as herb names in the Prescription field, can be obtained in a straightforward way by enumerating Chinese strings segmented by commas or numbers in the according data fields. Totally, 5,000+ distinct terms are extracted in this way, forming a dictionary of terms denoted as . Table 1A shows the most frequent terms extracted from each of these semi-structured fields.
These extracted terms need to be transformed to their standard codes before downstream analysis can be proceeded. This can be achieved via two typical operations: splitting and mapping. Many terms extracted from these semi-structured fields tend to abbreviate multiple concepts to a single term. For example, term “taihouhuang” from the field of Tongue Picture is the abbreviation of two terms “thick tongue fur” and “yellow tongue fur”, term “maixianhua” from the field of Pulse Type is the abbreviation of two terms “stringy pulse” and “slippery pulse”, term “ganshenkuixu” from the field of TCM Pathogenesis is the abbreviation of two terms “deficiency of liver” and “deficiency of kidney”. Standardization of these terms can be achieved by identifying the multiple concepts compressed in one term, and listing the standard feature codes of these concepts in parallel (e.g., “taihouhuang” “thick tongue fur, yellow tongue fur”). We call this operation as “splitting”, as it divides one technical term into multiple features. On the other hand, many extracted terms refer to the same concept. For example, term “manyigan” is an abbreviation of “chronic hepatitis B”; term “dashengdi” and “xishengdi” refer to the same herb “dried rehamnnia root”. Standardization of these terms can be achieved by “mapping”, i.e., building a mapping table from these terms to their standard feature codes (e.g., “dashengdi” “dried rehamnnia root”, and “xishengdi” “dried rehamnnia root”). Please note that we may need the combination of splitting and mapping sometimes to standardize a term with complex structure.
Totally, 4,000+ features are generated for the 5,000+ extracted terms, resulting in a feature codebook . The transformation rules from terms in to features in are summarized in a term-feature map , based on which a structured feature table can be established from the semi-structured fields.
Unique properties of the free texts in the Archive
Next, we process the free texts in the unstructured Symptoms field. These free texts contain 1,177,007 Chinese character tokens, of which 2,678 are unique. From the text analysis perspective, these texts are unique in multiple dimensions. First, these texts contain a lot of TCM-specific technical terms rarely used elsewhere and many special terms invented by Prof. Zhou that are specific to this archive only. Second, some segments of these texts are highly repetitive. Cutting down these free texts into small pieces separated by natural boundaries (such as punctuation marks, ends of lines and so on), we obtained ~241,000 segments, of which ~126,000 are unique. Most of these unique segments are short strings with 10 Chinese characters, and many of them repeat heavily in the free texts: 3 segments appear 1,000 times, and the 2,100+ segments that appear more than 10 times contribute ~90,000 repeats together, which equivalents to 1/3 of the total number of segments generated from the free texts. We summarize these unique segments into a segment list . Table 1B shows the top 100 segments with the highest repeat frequency in . Third, these texts are written in a unique style that is very different from classic training corpus for Chinese text mining, which is typically based on news articles.
These facts mean that we need to capture the special technical terms in the archive to establish an archive-specific vocabulary, and a style-robust tool to process the free Chinese texts in the archive. Moreover, as most segments (especially these highly repeated ones) in the segment list can deliver one piece of intact information about patients, it is more efficient to achieve semantic understanding with these segments, instead of words or terms, as the basic language units.
Processing the unstructured symptoms field
In the past decades, many tools for processing Chinese texts have been proposed. In this study, we tried four popular “supervised” methods, namely Jieba, Stanford Parser [
26–
27], Language Technology Platform (LTP) [
28] and THU Lexical Analyzer for Chinese (THULAC) [
29–
30], and a recently proposed “unsupervised” method called TopWORDS [
31], to process the free texts in the unstructured Symptoms field.
The four supervised methods emphasize precise text segmentation under the guidance of a preloaded vocabulary and high-quality training corpus. They typically match the target texts with words from a preloaded vocabulary, and do statistical inference when meeting ambiguous words based on a statistical model trained by manually segmented and labelled training corpus. When the actual vocabulary is covered by the preloaded vocabulary and writing style of the target texts are close to the training corpus, these supervised methods usually perform pretty well. But, previous study [
31] also showed that when the actual vocabulary of the target texts contains a lot of words beyond the preloaded vocabulary, they often fail to recognize many of these unregistered words, especially when the writing style of the target texts is very different from the training corpus.
The unsupervised method TopWORDS, however, pays more attention on efficient new word discovery, although it can also be used as a tool for text segmentation. It can effectively discover previously unknown words and phrases when no preloaded vocabulary and proper training corpus are available, or the preloaded vocabulary and the training corpus do not fit the target texts well. Detailed information about the five NLP tools for processing Chinese texts can be found in the Appendix.
As shown in Figure 1, each of these methods returns a term dictionary and a term boundary vector as the outputs for term discovery and text segmentation, respectively. The term dictionary is the set of terms identified by the method, and the term boundary vector is a vector with the same length of the target texts, whose element can take three values: if there is a natural boundary (e.g., punctuation mark, end of line and so on) behind the th position of the target texts, if the method puts a term boundary there, and otherwise. The detailed term boundary vectors of different segmentation tools can be found in the website of “Zhou Archive for TCM Study. Table 2 summarizes and compares their performance in multiple angles, from which we can see that both the reported term dictionary and the predicted term boundary vector vary significantly across different methods, indicating the critical challenges in processing and understanding domain-specific Chinese texts.
Table 2(A) summarizes the number of nontrivial terms (i.e., terms with more than one Chinese character) discovered by different methods. The number varies from the smallest 23,989 terms reported by Jieba to the largest 47,248 terms reported by TopWORDS. Ignoring rare terms that appear only one time in the target texts, Table 2B recounts the number of frequent nontrivial terms. The number drops by half to 10,000+ for the four supervised methods, while only drops by 8% for the unsupervised TopWORDS. We note that the term dictionaries reported by the five different methods do not match well with each other: and achieve the largest overlap ratio of 60%–70% for nontrivial terms and 70%–80% for frequent nontrivial terms, and the overlap ratio of all the other pairs varies from 20%‒70%. These facts reflect the critical challenges in term discovery from domain-specific Chinese texts.
Combining the five term dictionaries, we obtain a joint term dictionary of 80,000+ distinct terms:
We identified 20,000+ technical terms with clear medical meanings (including 14,300+ symptoms, 3,600+ body parts, 2,000+ disease names, 900+ lab tests and 400+ medical treatments), 60,600+ background terms (i.e., correct words and phrases with no medical meanings), and 5,700+ suspicious terms whose semantic meanings cannot be easily determined. Table 1 C lists the top 30 terms for each of the 6 term categories in . Table 2(C) shows the contribution of different methods to discovery of technical terms, background terms and suspicious terms, respectively. The results suggest that the supervised methods indeed missed a lot of meaningful technical terms in this study, while the unsupervised TopWORDS discovers 13,047 (61%) technical terms missed by other methods. Figure 3 shows the length distribution and type distribution of the discovered technical terms in by different methods. From these figures, we can see that TopWORDS tends to report more longer words than the supervised methods, and contributes most to the discovery of technical terms. A term-feature map for these discovered technical terms is established to achieve term standardization in a similar way to establish .
The variation on term discovery naturally leads to variation on text segmentation. Table 2(D) and 2(E) compare the performance of different methods on text segmentation based on the term boundary profile
. We propose two different criteria for the comparison of two methods: the less rigorous criterion based on
segmentation sites, and the more rigorous criterion based on
segmented terms. Let
and
be the term segmentation vectors of the two methods to be compared. The
segmentation site criterion simply counts the number of common sites segmented by both methods,
i.e., #
; the
segmented term criterion, however, counts the number of common terms segmented by both methods,
i.e., #
. The degree of agreement of the five tested methods varies between 30% to 90% under the segmentation site criterion, and drops to 20%–85% under the more rigorous segmented term criterion. The supervised methods tend to segment the target texts into smaller pieces, while the unsupervised TopWORDS tends to cut the target texts with a larger granularity.
Because many technical terms are missed by each of these tested methods, it’s risky to proceed the downstream analysis based on the segmented texts by any of these methods alone. To get rid of this dilemma, we feed to TopWORDS the joint term dictionary as the preload vocabulary, and refit the model on the free texts from the segment list to do text segmentation. We choose TopWORDS as the segmentation tool because the segmentation results from it enjoy a proper granularity for semantic understanding of the free texts in the archive.
Totally, 8,000+ features are generated for the 20,000+ technical terms discovered, resulting in a feature codebook . The transformation rules from terms in to features in are summarized in a term-feature map . Mapping the technical terms in the segmented texts to their standard feature codes based on with all the background terms ignored, we can transform the free texts in the unstructured symptoms field into a structured feature table of binary features (with 1 for presentence of a feature, and 0 for absence).
Generating a united feature table via data integration
The structured feature tables , and generated from the structured, semi-structured and unstructured data fields of the archive can be further integrated into a united feature table as the final output of the data processing procedure. Some features may be shared by , and . We combined information about these shared features via data integration.
Totally, 14,000+ distinct features are involved in the united feature table for the 26,000+ visits. And, 26,000+ transformation rules for technical terms are created to establish the table. Detailed contents about the united feature codebooks , the united term dictionary , the united term-feature map , and the united structured feature table can be found in the website of “Zhou Archive for TCM Study.
STATISTICAL LEARNING OF THE STRUCTURED FEATURE TABLE
Based on the structured feature table obtained, a series of statistical analysis can be implemented to learn potential principles of TCM clinical practice from a data-driven perspective. Considering that the missing rate of some data fields is very high, to avoid the potential influence of these missing values, in this study we only select the technical terms from the following 7 data fields whose missing rate in first-visit records are less than 30%: Symptoms, Tongue Picture, Pulse Type, Disease, Disease Category, TCM Pathogenesis and Herbs in TCM Prescription. Totally, 7,743 features are involved in these selected data fields, among which 1,926 are rare features whose frequency in the first-visit records 1. We ignored these rare features in the downstream analysis, and only focused on the 5,817 frequent features.
Correlation analysis
Our first effort is a correlation analysis to capture the overall correlation structures of all the 5,817 selected cross-category features. A correlation matrix is obtained and most correlation coefficients in the matrix fall into indicating reltatively weak correlation. Figure 4 shows the correlation heat map of a few highly-correlated features which correlate with some other features with a correlation coefficient beyond .
From the heat map, we can observe a few blocks of highly correlated features. For example, the largest feature block highlighted in a black box reveals that disease category “cancers” are closely related to TCM pathogenies “toxic head” and “phlegm stasis”, and a group of herbs “andeophorae radix”, “glehniae radix”, “buttercup root”, “pseudostellariae radix”, “ophiopogonis radix”, “appendiculate cremastra pseudobulb”, “herba euphoribiae helioscopiae”, “agrimony”, “hedyotis”, “barbated skullcup herb”, “herba celiptae”, “ligustri lucidi fructus”. The smaller feature block next to the largest one discovers that a few features on Pulse Type are closely related. The other feature block located at the left-bottom corner reveal a group of herbs (i.e., “lignum phetimiae”, “orientvine stem”, “largeleaf gentian root”, “preparemonkshd moter root”, “aconiti preparata”, “asarum sieboldii”) closely related to disease category “rheumatism”. These discoveries reveal meaningful TCM knowledge.
Enrichment analysis
To further investigate how TCM concepts such as TCM diseases, TCM pathogenesis and TCM therapies connect to each other and other features such as diseases, symptoms and herbs, we did the enrichment analysis below. For simplicity, let’s take “stuffiness of stomach”, the most frequent TCM disease in the archive, as an example. First, we selected from the archive all first-visit records of which the feature “stuffiness of stomach” takes 1. We denote the subpopulation of selected records as , the subpopulation of other first-visit records as . Next, we identified the top 5 diseases, symptoms, TCM pathogenesis, TCM pathogenesis and herbs in the selected records in with the highest relative frequency. Third, for each of the selected feature, we calculated its odds ratio between and as its enrichment measure with respect to feature “stuffiness of stomach”. At last, we plotted the relative frequency and the enrichment measure for all feature selected for “stuffiness of stomach” in a bar plot as showed in Figure 5A. Such an enrichment plot demonstrates rich information about TCM disease “stuffiness of stomach”: (i) it is associated with chronic gastritis, chronic superficial dermatitis, chronic atrophic antralgastritis, headache and astriction, (ii) symptom gastric distention is major signature of it, (iii) “liver-stomach disharmony”, “dampness and heat resistance” and “stomache weak energy stagnation” are the major TCM pathogenesis behind it, and (iv) “processed pinellia preparata”, “cyperi rhizoma”, “perillae caulis”, “coptidis root” and “magnolia bark” are the primary herbs to treat it. These messages help us understand the basic properties of the feature efficiently from multiple angles.
Figure 5 displays the enrichment plots for a few most frequent TCM diseases, TCM pathogenesis and TCM therapies in the archive. We can read many insightful messages from these figures. For example, TCM pathogenesis “dampness-heat” is highly associated with liver-related diseases, and takes “impairment of liver and spleen” and “phellodendri chinensis cortex” as the signature symptom and herb respectively; TCM therapy “regulating and harmonizing the liver and spleen” is a regular treatment for liver-related diseases and TCM pathogenesis, and takes “barbary wolfberry fruit”, “pseudostellariae radix”, “deep-fried atractylodis macrocephalae rhizoma”, “salviae miltiorrhizae” and “paeoniae radix rubra” as the primary components of prescription.
Embedding analysis
Considering that correlation and enrichment analyses basically utilize the data information in a pairwise fashion, they may miss signals reflecting high-order structures in the data. In this section, we analyzed the structured feature tables obtained from the Zhou Archive from an alternative perspective via embedding methods [
32–
34]. Different from previous correlation and enrichment analyses, embedding analysis considers co-occurrence patterns of different features globally, and embeds features with no geometric meanings (
e.g., symptoms and herbs) into a linear space with geometric interpretation.
Treating each feature as a “word” and each record as a “document”, we can naturally apply approaches designed for
word embedding to the TCM data. Here, we selected the matrix factorization approach [
32] as the primary tool to achieve feature-level embedding, and applied it to the
co-occurrence matrix
of the frequent features, where
counts the number of first-visit records that contain both feature
and feature
, to embed the frequent features into a 300-dimensional linear space. The detailed embedding vectors of the features can be found in the website of “Zhou Archive for TCM Study. Features that stay close to each other in the embedding space tend to associate closely or share similar functions. Geometric structure of these embedding vectors can be visualized in a 2-dimensiaonal space by techniques such as
multi-dimensional scaling (MSD) [
34]. Figure 6 shows the MSD plot of the 50 feature pairs with the shortest within-pair distance in the embedding space, most of which precisely reflect TCM knowledge. For example, the two symptoms in pair {painful forehead, dizzy forehead} are complication that often happen concurrently, the two herbs in pair {processed herba pogostemonis, processed folium perillae} have similar function in expelling cold and vomiting, and the symptom-herb pair {chapped, sweet almond} corresponds to a well-known treatment to chapped and irritated skin in TCM.
Beyond the feature-level embedding, we can also embed records into a linear space in a similar way. Representing each record by a vector of 4,776 binary variables with 1s and 0s standing for the presence and absence of features in the record, we used t-Distributed Stochastic Neighbor Embedding (t-SNE) [
33] to embed the high-dimensional records into a 2-dimensional representation space. Unlike the linear dimension reduction technique Principal Component Analysis (PCA) by maximizing variance to preserve large pairwise distances which fails in non-linear structure cases, t-SNE tries to retain the local structures while preserving almost the same topology by embedding the original high dimensional space with a Student t-distribution. Figure 7 demonstrates the results from t-SNE, where each point corresponds to a record with the color stands for the disease category of the record. Among the 16 distinct diseases categories in the archive, the 5 major categories,
i.e., Cancers, Digestive Diseases (DD), Infectious Diseases (InD), Neurological Diseases (ND) and Respiratory Diseases (RD), together with miscellaneous Diseases (MiD) contribute ~75% of the records. Interestingly, points associated with the 5 major categories cluster well in the embedding space, with the MiD-related points spread out everywhere. These phenomena reflect heterogeneity of TCM practice among different disease categories and are consistent to the definition of MiD.
Association pattern discovery
Next, we try to discover association patterns of the selected features from structured feature table. As all features in the feature table is binary, the data structure perfectly fit the classic
Market Basket Analysis (MBA) problem [
35] in machine learning, which aims to discover items that tend to purchased together from a collection of baskets purchased by customers to a supermarket.
Association Rule Mining (ARM) [
35,
36] is the classic solution to the MBA problem, which enumerates all
frequent item sets whose
frequency (sometimes called
support)
and generates
association rules whose
confidence based on these frequent item sets enumerated. Although computationally efficient to process large scale datasets and logically straightforward to understand, ARM often generates too many redundant association rules and tends to miss high-order association patterns that are important to many practical problems.
Recently, Refs. [
37,
38] reformulated the MBA problem into a statistical model selection problem and proposed a novel solution to this classic problem from the statistical point of view. Assuming that each basket is composed of a collection of item modules (called
themes) randomly selected by the customer with different selection probabilities and different baskets are generated independently from the same mechanism, Refs. [
37,
38] approximated the data generation procedure of the baskets via a
Theme Dictionary Model (TDM). Starting with an over-complete initial theme dictionary composed of the frequent item sets generated by ARM, and pruning it based on statistical inference and model selection principles, the TDM-based method can discover themes (especially the high-order ones) in the true theme dictionary effectively. Applying TDM to a collection of classic prescriptions in the history of TCM, [
37] discovered hundreds of herb modules which tend to be used together in TCM practice. Many of these herb modules match well with TCM knowledge and successfully reveal the internal structure of TCM prescriptions from a data-driven perspective.
With the support of EMRs in the Zhou Archive which contains both symptoms and prescriptions, in this study, we generalize this idea to learn association patterns between a module of symptoms and a module of herbs. By treating symptom-related features and herbs in the prescriptions as “items” and each EMR as a “basket” of these items, we obtained 23,000+ effective “baskets” from the first-visit records in the archive. The original TDM can discover themes of all items in the baskets from one single category. In this study, however, we are more interested in cross-category themes containing both symptoms and herbs, which connect a module of symptoms to a module of herbs and provide information on how TCM treatment is determined based on the observed symptoms. To fit this special request, we modified the original TDM approach to a variant version by adding some filters to label items from different categories which rules out all single-category themes via a pre-screening of themes in the initial theme dictionary. After removing those redundant single-category association rules in the initial theme dictionary, it largely reduces the number of the partitions for all baskets. We refer to this variant version of the original TDM approach as to the Cross-category TDM approach, which is abbreviated to CTDM. Compared to the original TDM approach, the CTDM approach enjoys a better computational efficiency as many single-category themes are excluded from the model priori.
Please note that we meant to include all diagnosis-related features here to link symptoms to herbs directly. And, we only kept the first-visit records here, because the longitudinal records of the same patient are often highly correlated with each other (the physical condition of a patient typically does not change dramatically within a couple of months, leading to similar symptoms, diagnoses and treatments), and may seriously violate the assumption of independent samples behind TDM. We also removed baskets containing more than 30 items which may largely slow down the procedure. Totally, 5,175 effective items survived this item-basket screening procedure, resulting in a collection of baskets with 20 items on average.
Same as the TDM approach, the CTDM approach has two control parameters: the minimum theme frequency parameter and the maximum theme length parameter . In this study, we set and , and discovered ~1,000 cross-category themes from the archive. Table 3 shows the top 60 cross-category themes discovered by the CTDM approach, each of which connects a module of symptoms to a module of herbs, revealing important insights of TCM treatment. For example, the connections between herb modules {asparagus cochinchinensis, lilium davidii} and symptom modules {poor sleep} and {feel agitated}, the connection between symptom module {dry mouth} and herb modules {figwort root, flos chrysanthemi indici}, the connections between herb modules {tribulus terrestris, gastrodiae, ligusticum wallichii } and symptom modules {dizziness} and {headache}, the connection between symptom modules {oppression in chest, palpitation} and herb module {salvia miltiorrhiza} all precisely reflect important principles in TCM practice.
Please note that we clustered these top themes based on the symptom module and rearranged their location in the table to deliver information more efficiently. The complete list of discovered themes can be found in the website of “Zhou Archive for TCM Study.
CONCLUSION AND DISCUSSIONS
In this study, we introduce the Zhou Archive, a large-scale database of expert-specific EMRs containing comprehensive information about 73,000+ visits to one TCM doctor by 26,000+ distinct patients over 35 years from 1980 to 2015. Processing the text data in the archive via a series of data processing steps with the help of multiple popular NLP tools for Chinese texts, we transformed the semi-structure EMRs in the archive to a well-structured feature table. A series of statistical analyses are implemented for the structured feature table obtained to learn principles of TCM clinical practice from the archive. Results from these analyses reveal insights to understand TCM from a data-driven perspective. Besides the statistical analysis demonstrated in this paper, many other methods and new tools can be applied or developed to dig deeper into this archive. We hope the data processing and analysis framework proposed in this paper can motivate other studies for understanding TCM based on large-scale EMR datasets.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature