Introduction
Clinical big data in dermatology: promise and paradox
The digitization of healthcare has endowed dermatology with vast repositories of electronic medical record (EMR) data, accumulated through decades of clinical practice across global institutions[
1]. These data encapsulate the complete patient journey, including demographics, diagnostic codes, laboratory results, medication histories, imaging studies, and clinical narratives, representing an unprecedented resource for advancing dermatological science[
2]. Yet the field confronts a fundamental paradox: while data volumes expand exponentially, the extraction of actionable knowledge remains severely constrained by structural and semantic heterogeneity[
3].
This data paradox is particularly pronounced in dermatology compared with other medical specialties. The discipline’s clinical documentation inherently relies on descriptive characterization of skin lesions, introducing linguistic variability that complicates computational analysis[
4]. Furthermore, in healthcare systems where traditional Chinese medicine (TCM) operates alongside Western medicine (WM), clinical records incorporate additional dimensions of complexity: syndrome differentiation patterns, pulse and tongue diagnostics, and herbal prescription data[
5]. These multiple layers of heterogeneity collectively produce the “data-rich, information-poor” phenomenon that currently impedes the translation of clinical experience into research evidence and, ultimately, into improved patient outcomes.
Data standardization: a foundational enabler of data-driven dermatology
Data standardization constitutes the critical bridge between raw clinical data and transformative research applications. This process operates across three interrelated dimensions that must be addressed systematically. Syntactic standardization ensures consistent data formats, structures, and coding schemes across disparate source systems (e.g., harmonizing data formats such as HL7 FHIR or standardizing coding systems such as ICD and SNOMED CT). Semantic standardization establishes shared meanings for clinical concepts, enabling genuine interoperability between institutions with divergent documentation practices (e.g., aligning different representations of “eczema” or TCM syndrome classifications under unified ontological definitions). Pragmatic standardization focuses on ensuring that data are fit for specific analytical or clinical use contexts by defining rules for data processing, integration, and interpretation. This includes, for example, (i) constructing derived variables such as a composite “disease severity score” from heterogeneous clinical indicators; (ii) establishing rules for resolving temporally misaligned data in longitudinal analyses such as aligning laboratory results with visit dates; and (iii) defining inclusion criteria and data quality thresholds for specific research tasks. Through such context-sensitive transformations, pragmatic standardization bridges the gap between interoperable data and actionable insights[
6,
7]. In the context of integrated dermatology, standardization efforts must simultaneously accommodate WM constructs and TCM conceptual frameworks.
The transformation of standardized EMR data enables downstream applications that can fundamentally reshape healthcare delivery and research. Clinical decision support systems can leverage standardized data to deliver evidence-based recommendations at the point of care, potentially enhancing diagnostic accuracy and treatment selection[
8]. More significantly, artificial intelligence applications demand standardized, high-quality data for model training and validation, positioning standardization as a prerequisite for artificial intelligence (AI)-enabled healthcare innovations[
9]. Thus, data standardization is not merely a technical process but a fundamental enabler of precision dermatology, personalized therapeutics, and data-driven discovery.
Psoriasis and integrated medicine: an exemplary case for standardization
Psoriasis provides an ideal model for exploring EMR data standardization in integrated Chinese and WM, for several compelling reasons. First, psoriasis exemplifies the therapeutic potential of integrated medicine. Clinical guidelines increasingly endorse combination approaches that leverage the strengths of both medical systems[
10,
11]. From the TCM perspective, psoriasis pathogenesis is attributed to heat-toxin in the blood phase (Xuefen), with common syndrome patterns including blood-heat (Xue Re Zheng), blood-stasis (Xue Yu Zheng), and blood-dryness (Xue Zao Zheng)[
12]. WM conceptualizes the disease through inflammatory and metabolic mechanisms, particularly the IL-23/IL-17 axis and lipid metabolism dysregulation[
13]. The conceptual intersections between these frameworks, such as the relationship between “phlegm-stasis interconnection” (Tan Yu Hu Jie) in TCM and hyperlipidemia in WM, offer rich opportunities for integrative research that can only be realized through standardized data collection and analysis[
14–
16].
Second, psoriasis reveals the complementary strengths of Western and Chinese therapeutic approaches. The biologic agents in WM face limitations including long-term safety concerns, high costs, and emerging drug resistance[
17,
18]. Their single-target blockade mechanisms may inadequately address the disease’s complex, multi-pathway pathophysiology. Conversely, TCM formulations such as Qing Bi Yin employ multi-component, multi-target strategies that regulate key immune pathways including Th17 differentiation and S1P/S1PR1 signaling[
19–
21]. Standardized data can illuminate how these approaches might be optimally sequenced or combined, advancing the journal’s commitment to therapeutic innovation.
Third, the treatment of psoriasis presents an urgent need for TCM data standardization. Currently, the absence of uniform standards for TCM syndrome classification constrains the translation of master practitioners’ experiential knowledge into evidence-based research that could benefit the broader medical community. Recent studies demonstrating the efficacy of formulations such as Xiaoyin Jiedu Decoction, which has been shown to modulate S1P/S1PR1 signaling and Th17 differentiation[
12,
22], underscore the importance of systematic data collection for advancing integrated practice. Standardization efforts can preserve and validate this clinical wisdom, making it accessible to rigorous scientific inquiry.
Fourth and finally, psoriasis carries substantial global burden while offering well-validated outcome measures that anchor standardization efforts[
23]. Established assessment tools, such as Psoriasis Area and Severity Index (PASI), Body Surface Area (BSA) involvement and Dermatology Life Quality Index (DLQI), provide common metrics that enable meaningful comparisons across treatment approaches, patient populations, and healthcare systems. This combination of global relevance and measurement maturity positions psoriasis as an ideal vehicle for demonstrating how standardized integrated data can generate real-world evidence (RWE) applicable worldwide.
Standardization process and methods for psoriasis EMR data
Data source integration and governance
The integration of EMR data from multiple sources constitutes the foundational step in any standardization endeavor (Figure 1). Psoriasis patient data typically reside in disparate systems within healthcare institutions, including hospital information systems, laboratory information systems, pharmacy order systems, and clinical documentation platforms[
24]. Each system operates with distinct data structures, coding schemes, documentation practices, and update frequencies, creating heterogeneity that mirrors the interdisciplinary complexity of integrated medicine itself. Establishing a robust governance framework is therefore essential to address data quality challenges that would otherwise propagate through subsequent analyses.
Data governance must address multiple dimensions of quality systematically. Missing value handling requires nuanced clinical judgment because absence patterns carry meaning. Certain TCM diagnostic elements may be absent in WM-focused encounters, while some laboratory tests may not be routinely ordered in TCM-predominant settings. It is essential to distinguish between missing-not-at-random and missing-completely-at-random for appropriate analytical handling. Outlier identification must account for the natural variability of psoriasis severity measures. For example, PASI scores can range from nearly 0 to over 50 in cases of erythrodermic psoriasis, necessitating clinical context to distinguish true outliers from legitimate extreme values. Temporal alignment is crucial for longitudinal analyses and demands consistent timestamp formats, time zone handling, and recognition of seasonality effects that may influence disease activity and treatment response. Patient deduplication algorithms must account for variations in name transcription, identification number assignment, and potential changes in demographic information over extended follow-up periods[
1,
25].
Core data elements
Western medicine core data elements
WM data elements for psoriasis encompass several critical categories that require systematic capture and standardization to enable meaningful, integrated research (Table 1). Diagnostic information focuses on the International Classification of Diseases, Tenth Edition (ICD-10) coding for psoriasis (L40.x series) and its common comorbidities, which reflect the systemic inflammatory nature of the disease. These comorbidities include hyperlipidemia (E78.x), hypertension (I10–I15), diabetes mellitus (E10–E14), and psoriatic arthritis (M07.x). Core laboratory indicators include lipid profiles, which are increasingly recognized as markers of metabolic dysregulation in psoriasis pathogenesis; inflammatory markers tracking disease activity; and liver function tests, which are essential for monitoring treatment safety[
26,
27]. Medication data capture systemic treatments, such as methotrexate, cyclosporine, acitretin, apremilast, and biologics that target TNF-α, IL-17, and IL-23, as well as topical therapies, including corticosteroids, vitamin D analogues, and calcineurin inhibitors. Disease severity measures include clinician-assessed metrics, such as PASI scores, BSA involvement percentages, and physician global assessments, as well as patient-reported outcomes, such as the DLQI. These measures enable a comprehensive evaluation of treatment effectiveness from clinical and quality-of-life perspectives.
Traditional Chinese medicine core data elements
TCM data elements present distinctive standardization challenges arising from their qualitative nature, hermeneutic complexity, and terminological variation across different schools of practice (Table 1). Syndrome differentiation (Bian Zheng) constitutes the most critical TCM variable, requiring extraction from admission diagnoses, discharge diagnoses, or clinical progress notes[
12]. Common psoriasis syndrome patterns include: blood-heat syndrome (Xue Re Zheng), characterized by active inflammatory lesions; blood-stasis syndrome (Xue Yu Zheng), associated with chronic plaque morphology; blood-dryness syndrome (Xue Zao Zheng), presenting with prominent scaling; and dampness-heat syndrome (Shi Re Zheng), featuring exudative lesions[
28]. Establishing a standardized keyword library is essential to identify these patterns in unstructured text, accounting for synonymous expressions, regional terminology variations, and evolving diagnostic conventions across generations of practitioners.
Extracting the four diagnostic pieces of information (Si Zhen) requires natural language processing (NLP) techniques that are specifically adapted to classical Chinese medical terminology[
28]. Descriptors of tongue appearance, such as “red tongue”, “dark purple tongue with petechiae”, “yellow greasy coating”, and “white coating”, must be identified from physical examination records where they may be embedded within free-text descriptions. Pulse characteristics, including “slippery rapid pulse”, which indicates dampness-heat, and “wiry choppy pulse”, which suggests blood stasis, require sophisticated pattern matching from clinical narratives, which may use varied descriptive conventions. Symptom descriptions, such as “severe itching”, “burning skin lesions”, “dry constipation”, and “irritability”, need to be extracted from chief complaints and histories of present illness, where they may be expressed in either classical or vernacular terms. Herbal prescription data introduces additional complexity due to multiple nomenclature systems (including Latin binomials, Pinyin transliterations, and common Chinese names), dosage variations that reflect individual patient presentations, and diverse preparation methods (e.g., decoction, granules, and pills) that may affect bioavailability and therapeutic effects[
29,
30].
Quality control and data iteration
Given the inherent subjectivity and terminological diversity of TCM elements across practitioners and institutions, quality control for integrated EMR data demands particular attention to these elements. For syndrome differentiation classification, dual independent review by clinical experts, ideally including both TCM specialists and dermatologists familiar with integrated practice, is recommended, along with structured adjudication procedures for discordant cases[
29]. Validating against established clinical criteria ensures the semantic accuracy of extracted concepts. Benchmarking against expert-annotated corpora enables a quantitative assessment of the performance of extraction algorithms.
Continuous monitoring of data quality metrics, including completeness rates (the proportion of expected data elements present), consistency checks (logical coherence across related variables), and temporal validity (the appropriate sequencing and timing of clinical events), enables iterative improvement of extraction algorithms and standardization procedures[
31]. Machine learning approaches can identify patterns in discordant classifications by systematically analyzing sources of disagreement to refine extraction rules and improve automated accuracy over time. This creates a virtuous cycle of data enhancement: algorithms trained on expert-validated data progressively improve extraction quality. This expansion of the training corpus improves future model iterations. These quality assurance mechanisms are essential for building the trustworthy data infrastructure required to support big data analysis and digital diagnostics.
Transformation application scenarios
Building an integrated Chinese and Western medicine psoriasis specialty database
Standardized data allows for the creation of extensive research databases that combine WM data with TCM syndrome patterns and herbal prescriptions. Compared with traditional EMR repositories, this specialized database offers distinct advantages. Standardization ensures consistency across different time periods and departments, enabling longitudinal analyses that were previously confounded by documentation variability. Organic integration of TCM syndrome information with WM clinical indicators provides complete data for integrated medicine research. This preserves the holistic perspective essential to TCM while enabling quantitative comparison with Western metrics. Conversion of unstructured text into structured data enables direct statistical analysis and machine learning applications. Strict quality control processes ensure accuracy and reliability sufficient for regulatory-grade RWE.
Future extensions point toward multimodal data integration. Incorporating clinical photographs of skin lesions and tongue appearances would enable computer vision analysis for objective severity assessment and diagnostic support. Integrating genomic data could support pharmacogenomic investigations into genetic predictors of treatment response to WM and TCM therapies, potentially revealing biomarkers that guide personalized treatment selection. These expansions would transform the database from a clinical repository into a comprehensive research platform capable of supporting discovery across multiple investigative modalities.
Supporting real-world studies of integrated medicine
The standardized database goes beyond evaluating therapeutic effects to enable the exploration of the underlying mechanisms of integrated TCM-WM psoriasis treatment through advanced data mining techniques. Studies exploring mechanisms benefit from the ability to analyze associations between TCM syndromes, laboratory parameters, and treatment responses on an unprecedented scale. Association rule mining, machine learning, and network analysis methods can reveal relationships between TCM syndrome types, lipid metabolism indicators, and medication treatments, generating hypotheses about psoriasis pathogenesis and integrated therapeutic mechanisms that can inform subsequent experimental investigations.
Machine learning algorithms can identify patterns with direct clinical and public health implications that were previously unrecognized. For example, do patients with blood-stasis syndrome exhibit elevated LDL cholesterol levels and increased cardiovascular risk? If so, this TCM phenotype could identify a subgroup requiring enhanced cardiovascular monitoring. Do heat-clearing and blood-cooling treatment approaches correlate with improvements in inflammatory markers and lipid profiles, indicating potential dual-benefit therapeutic strategies? These questions exemplify the translational potential of integrated data analysis. This mechanistic research illuminates the scientific basis of TCM syndrome differentiation and treatment and provides a rationale for developing novel, integrated treatment strategies that target both inflammatory pathways and metabolic abnormalities.
However, it is important to acknowledge that while RWE derived from standardized databases is invaluable for hypothesis generation and understanding real-world practice patterns, it is generally complementary to, rather than a replacement for, randomized controlled trials (RCTs). RWE can provide insights into treatment effects, patient subgroups, and long-term outcomes in diverse populations, often in settings where RCTs may be difficult or impractical to conduct. Yet, RCTs remain the gold standard for establishing causality and testing interventions under controlled conditions. Thus, integrating findings from both RWE and RCTs can provide a more comprehensive understanding of treatment efficacy and safety, leading to more informed clinical decision-making and the development of personalized therapies.
Empowering AI applications
The availability of large-scale, standardized datasets allows for the development of AI-assisted TCM diagnostic tools to support clinical practice, education, and research dissemination. Machine learning models can predict TCM syndrome patterns based on WM laboratory results and symptom descriptions. This provides decision support for practitioners who are less experienced in TCM diagnosis[
32]. It also has the potential to improve diagnostic consistency across clinical settings. Recent studies have demonstrated the feasibility of using algorithms, such as LASSO regression, support vector machines, and random forests, to identify gene expression patterns characteristic of different TCM syndromes in patients with psoriasis[
33]. This suggests that molecular signatures may eventually complement traditional diagnostic methods. NLP techniques can extract TCM diagnostic information from clinical narratives on a large scale. This facilitates the large-scale analysis of practitioner experience and the identification of expert diagnostic patterns that would otherwise remain tacit. Clinical decision support systems are an advanced application in which AI models integrate comprehensive patient information to recommend treatments based on both Western medical guidelines and TCM syndromes. These systems have the potential to democratize access to integrated medical expertise, thereby improving care consistency across practice settings and potentially reducing geographic disparities in access to integrated medicine. Recent advances in federated learning offer pathways for multi-center collaboration without compromising data privacy[
34]. This enables the development of more generalizable models that reflect diverse patient populations while maintaining compliance with data protection regulations.
Limitations and future directions
Current limitations and challenges
There are several significant limitations in the current standardization of EMR data for integrated medicine that must be acknowledged. These challenges define the boundaries of current capabilities and priority areas for methodological innovation. First, the incompleteness of historical data constrains retrospective analyses. The lack of TCM four diagnostic information (Si Zhen) in existing records complicates efforts to construct complete diagnostic pictures and may introduce systematic bias.
Second, the subjective nature of TCM diagnosis introduces variability. Different practitioners may classify the same patient into different syndrome categories based on their training, lineage, and clinical experience. While this inter-rater variability reflects sophisticated contextual reasoning, it challenges attempts at standardization, particularly given the absence of a true gold standard for TCM syndrome classification. This absence has important implications: it limits the ability to objectively evaluate model performance, complicates bench-marking across studies, and introduces uncertainty in supervised learning tasks. In practice, this limitation may be partially mitigated through expert consensus approaches, multi-annotator agreement frameworks, and multi-center validation strategies, which can provide more robust and generalizable reference standards for both data standardization and algorithm development[
35].
Third, technological tools for TCM NLP remain underdeveloped. Beyond general challenges such as classical linguistic constructions and metaphorical descriptions, TCM clinical text presents both syntactic and semantic complexities[
36,
37]. Syntactically, clinical notes often lack standardized sentence structures, include fragmented or telegraphic expressions, and vary widely across practitioners and institutions[
35]. Semantically, many TCM terms are highly context-dependent, where a single term or character may carry multiple meanings depending on clinical context, and different expressions may refer to similar underlying concepts. In addition, synonymy across traditions and implicit knowledge embedded in narrative descriptions further complicate entity recognition and relation extraction[
37]. These characteristics limit the effectiveness of conventional rule-based and statistical NLP approaches. However, recent advances in large language models (LLMs) offer promising avenues for addressing these challenges, particularly in handling contextual ambiguity, cross-domain knowledge integration, and low-resource language settings[
31,
38]. Future work integrating domain-specific pretraining and expert-informed fine-tuning may substantially improve NLP performance in TCM contexts.
Fourth, lack of validated reference standards impedes method development. The absence of large-scale, expert-annotated corpora for TCM syndrome classification hinders the validation of automated extraction methods and the objective comparison of competing approaches. The data quality issues further affect the reliability of research. Inconsistent coding practices, missing values, and documentation errors are magnified in integrated medicine contexts, which require harmonization across two distinct medical frameworks. Finally, single-institution generalizability remains uncertain. Models developed at single institutions may have limited generalizability due to population-specific characteristics and practice patterns.
Recommendations for future work
Addressing these limitations requires coordinated action across multiple levels. First, front-end interventions in EMR system design can improve data quality at the source. For example, structured data entry modules for TCM diagnosis, including dropdown menus for syndrome patterns and checkboxes for tongue and pulse characteristics, would dramatically improve the completeness and consistency of the data. However, these tools must be designed collaboratively with clinicians to capture diagnostic nuances while supporting efficient workflows. Second, the development of dermatology-specific integrated medicine data standards is a critical infrastructure priority. Standardized TCM terminology and coding systems developed through consensus processes involving professional societies and regulatory bodies would facilitate multi-center data sharing. Third, integration of emerging technologies offers methodological advances. LLMs could substantially improve the extraction of information from unstructured clinical narratives. Explainable AI methods are particularly relevant for TCM applications because clinical acceptance requires that algorithmic recommendations be interpretable within traditional diagnostic frameworks. Fourth, multi-center collaborative research through federated learning architectures offers a way to overcome limitations of single-center studies while preserving data privacy. Finally, prospective validation studies are needed to assess the clinical utility of TCM in real-world settings, evaluating not only technical performance, but also the impact on diagnostic consistency, treatment appropriateness, and patient outcomes.
Conclusion
Standardizing EMR-based clinical data for the integrated treatment of psoriasis with both Western and Chinese medicine represents a foundational investment in the future of dermatology. By creating comprehensive frameworks that incorporate both WM and TCM data elements, we can transform heterogeneous clinical documentation into structured, analyzable resources. These resources will enable the generation of rigorous RWE and AI-assisted clinical decision support. Most importantly, this work develops rigorous methodologies for standardizing TCM data within an integrated framework. It creates a replicable model that showcases the value of integrated medicine to the international research community. This model contributes uniquely to worldwide efforts to improve skin health through innovation and interdisciplinary collaboration.
The Author(s) 2026. This article is published by Higher Education Press on behalf of People’s Medical Publishing House.