1 Introduction
Multivariate time series (MTS) data are prevalent across various sectors, including healthcare [
1,
2], finance [
3], and cybersecurity [
4–
6]. To enhance the efficiency and accuracy of analyzing MTS data, machine learning techniques are often deployed in downstream tasks, ranging from predicting stock market trends [
7] to assessing patient data [
8,
9].
During MTS data collection, practitioners across various sectors often encounter issues such as irregularities in datasets [
10,
11]. An example of this can be seen in the healthcare sector, where the monitoring of patient health metrics via wearable devices may not always be uniform. This lack of uniformity can result from several factors, such as device malfunctions or irregular usage patterns by individuals, leading to data being captured at irregular intervals. Irregularity can be formally defined as the presence of uneven time intervals between observations. As shown in Fig.1, regular multivariate time series are observed at consistent, evenly spaced time intervals with no missing observations. Conversely, irregular time series are observed at inconsistent or uneven time intervals, with not all variables recorded at each timestamp, resulting in missing values. These irregularities pose challenges to recent advanced machine learning models [
12,
13], where the models typically rely on data being complete and uniformly sampled. Given the deviation from regularly uniformed data, there is a need for adapting machine learning methodologies to ensure effective analysis of irregular multivariate time series (IMTS) data.
Fig.1 Comparison of regular multivariate time series and irregular multivariate time series. Different colors represent different variables in the MTS. Solid circles denote observed data points, while hollow circles indicate missing data points |
Full size|PPT slide
Over the years, IMTS data analysis has gone through an evolution in techniques. In [
14], early IMTS methods were categorized into repairing methods [
15,
16], spectral methods [
17,
18], and kernel-based methods [
19,
20]. Yet these works fail to harness the full potential of deep models, particularly recurrent ones. To fix this, natural language based models such as Recurrent Neural Networks (RNNs)-based methods [
2,
11,
21,
22], attention-based methods [
10,
23,
24], and transformer-based methods [
25–
27] were developed and applied to IMTS data. However, IMTS data differs fundamentally from natural language data as each dimension is typically recorded by sensors with inherent relationships between them. Recent works have addressed this gap by integrating inter-series relationships [
23,
26,
28]. Among them, Graph Neural Network (GNN) has revolutionized the way we handle IMTS data, due to their ability in leveraging dependencies between sensors represented as nodes in a graph.
In GNN-based approaches, missing data sensors, depicted as nodes, can assimilate insights from their connected neighbours. This enables GNNs to make informed predictions even with missing values. The pioneering works [
29,
30] have applied a graph aggregation method after the first encoding stage in RNNs to enhance the imputation accuracy or representation learning. Alternatively, [
31] aimed to address this issue by learning the embeddings and dynamics of sensors through a temporal-aware attention with [
32,
33] expanding upon its foundation; considering extra interval information and transfer learning strategy. Liu et al. [
34] conceptualized the correlation of dynamic variables as a polynomial of a temporal matrix. These advancements indicate a growing interest in employing GNNs for analyzing IMTS. Unfortunately, these methods often presuppose static sensor relationships [
29,
35] or depend on attention mechanisms which require access to complete datasets [
31–
33]. Acquiring this prior knowledge or early access to complete data is challenging in streaming data scenarios - where data points are collected over time, from multiple sources or sensors, where each source or sensor represents a distinct variable or dimension in the time series. To address the dynamic nature of streaming data in IMTS, ongoing research needs to develop GNN-based models capable of incrementally updating their understanding, without the prerequisite of future data.
In this paper, we introduce a novel instance-atten-tion GNN methodology, specifically tailored to enhance the processing of IMTS data. Starting with minimal initial data, our approach proactively assesses node relationships, enabling instantaneous graph learning and adaptation. Central to our method is a dual strategy: firstly, reducing the initial data dependency to build model robustness, thus minimizing reliance on predefined graph structures or extensive datasets; secondly, implementing a continuous learning mechanism via instance-based attention, which fosters the real-time evolution of the dependency graph within an RNN framework. This innovative architecture not only facilitates the prompt updating of sensor embeddings, but also offers the versatility to cater to the specific needs of downstream tasks, whether data imputation or embedding extraction. Furthermore, our model is equipped with a frequency-specific handling strategy, ensuring it adeptly meets the varied demands of IMTS data analysis. Our contributions are as follows:
● We introduce a novel instance-attention strategy that learns and dynamically updates the graph structure in response to newly incoming data at each timestamp.
● Our model is versatile, efficiently managing both imputation and non-imputation tasks, with customized strategies for different sampling rates in IMTS.
● Experimental results show that our model’s performance surpasses current state-of-the-art approaches.
2 Related work
2.1 Classical methods for IMTS
Classical methods for handling irregular multivariate time series (IMTS) data have traditionally been categorized into three types [
14]: spectral-based, repair-based, and kernel-based methods. Recognizing that spectral-based and repair-based methods often lacked accuracy or required specific spectral structures, [
14] introduced a kernel-based approach. This method distinguished itself by employing a unique generalization of the inner product using non-parametric kernel functions. Subsequent advancements in kernel-based methodologies were presented by [
19,
20], which developed a novel approach for classifying sparse and irregularly sampled time series. However, these kernel-based methods, while innovative, faced limitations due to their reliance on kernel functions and the assumption of linear relationships, potentially restricting their applicability across diverse time series datasets.
2.2 Neural network methods for IMTS
Building on classical methodologies, the analysis of IMTS data has significantly evolved towards deep learning. GRU-D [
21] marked this transition by employing the Gated Recurrent Unit (GRU) architecture to innovatively integrate missing patterns through masking and time intervals. BRITS [
36] further developed data imputation techniques using bidirectional RNNs, avoiding strict data dynamic assumptions. Meanwhile, the use of Generative Adversarial Networks (GANs) with a modified GRU in [
37] sought to learn the distribution of multivariate time series for missing value imputation, yet challenges remained in fully capturing complex interrelations, especially the continuous nature of time-series data.
To fully capture the continuous nature of time-series data, ordinary differential equations (ODEs) have been utilized. ODE-RNNs [
11,
22] signified a pivotal development in modeling continuous-time dynamics, particularly suited for non-uniformly sampled series. By integrating ODEs, these models proficiently managed varying time intervals. Building on this, Continuous Recurrent Units (CRUs) [
2] advanced the concept further, allowing the hidden state to evolve through linear stochastic differential equations within an encoder-decoder architecture, thus enhancing the model’s capacity to interpret and process the temporal dynamics inherent in IMTS.
Recognizing the intrinsic nature of IMTS data, typically sourced from sensor measurements, there has been a shift towards leveraging attention [
10,
23,
24] and transformer [
25–
27] techniques to address the nuanced inter-series relationships. Warpformer [
26], an attention-based method, provided innovative solutions for intra-series irregularity and inter-series discrepancy by combining input representation with a warping module and attention mechanism, thereby significantly enhancing the management of sparse, irregularly sampled multivariate time series. In parallel, [
24] developed an attention framework that embeds continuous time values, thus improving the handling of variable-length series and boosting performance in tasks like interpolation and classification. In contrast to these methods, GNNs [
38] offer a more direct and interpretable approach by explicitly capturing the dependencies between sensors represented as nodes in a graph, providing a robust framework for addressing the challenges of irregularity in MTS data.
2.3 GNN methods for IMTS
GNNs stand out as a particularly promising framework in handling IMTS data, adept at capturing complex relationships due to their intrinsic design for graph-structured contexts, where nodes symbolize entities and edges depict interrelations. Central to GNN functionality is message passing, a process where nodes amalgamate received messages from neighbours to refine their representation, thus enriching it with local neighbourhood information.
Specifically, in the realm of MTS, frameworks such as METRO [
39] exemplify the general GNN application, though challenges such as irregular patterns and missing values necessitate more tailored approaches for IMTS. Efforts in irregular time series modelling have seen advancements with methods such as Raindrop [
31], employing temporal attention to discern pertinent temporal segments amidst irregular intervals, and subsequent enhancements by [
32], integrating time intervals with decay functions for improved encoding and temporal attention.
Furthermore, adapting to evolving data structures in real-time scenarios, dynamic GNNs, as explored by [
40,
41], offer updated graph structures to accommodate shifting patterns, albeit necessitating further research on scalability and computational efficiency. Meanwhile, traditional methods such as those of Wei et al. [
30] and GRIN [
29] focus on predefined attention mechanisms and known graph topologies, respectively, revealing a gap in handling dynamic graph structures inherent in real-world streaming. The dual-phase methodology of Barros et al. [
35], merging RNN and GNN capabilities, exemplifies the effort to marry temporal feature extraction with graph-based synthesis, pointing to the need for models that can adapt to both evolving graph structures and the real-time demands of streaming data.
Despite GNNs’ proven effectiveness in leveraging static and prior known relationships, their full potential in dynamic, real-time environments remains fertile ground for further investigation.
2.4 Self-supervised learning in GNN methods
Self-supervised learning in Graph Neural Networks (GNNs) has recently gained attention due to its ability to enhance feature learning without extensive labeled data. [
42] categorize current research in this direction into three classes based on the self-supervised learning (SSL) techniques explored.
The first class involves contrasting representations to distinguish between positive pairs (node level, graph level or sample level) and negative pairs, enhancing representation methods [
43]. The second class masks data, training the model to predict the masked parts or some properties, leveraging the temporal aspect of time-series data [
44]. In the third class, data is masked, but instead of predicting the masked parts, methods generate subgraphs or node features that resemble the masked data [
45]. Self-supervised learning in GNN has significant applications in recommendation systems, addressing common challenges including popularity bias and interaction noise. For example, AutoCF [
46] employs a masked graph autoencoder to aggregate global information by reconstructing masked subgraph structures, thereby enhancing representation discrimination ability. CGI [
47] adaptively learns to drop edges or nodes, creating optimized graph structures in an end-to-end manner, and reducing interaction noise.
Our proposed method shares some similarities with the masking methods in self-supervised GNNs, particularly the second class that predicts masked values. However, we employ a different approach by using recurrent values to enable the imputation of missing data. In our study, when there are no missing values, we use the ground truth values as the masked values to guide the model’s predictions. When there are missing values at a particular timestamp, we leverage the already learned models in the recurrent structure to extract representations and use these representations for imputation.
3 The DynIMTS methodology
In this section, we introduce our proposed algorithm DynIMTS (Dynamic IMTS). We begin by describing the preliminary formulation of the problem. Following this, we delve into the overall framework of DynIMTS, describing each component in detail. Finally, we present our tailored strategy for handling data with high sampling rates, as well as the implementation for downstream tasks.
3.1 Preliminaries
In a regular MTS, each data point consists of individual series and the th series is represented as a length- vector, denoted by . This vector is collected by a sensor over the time period from to . The entire data point is represented as a matrix of these vectors, , where , with each column corresponding to a vector collected at a specific time stamp .
In IMTS, different vectors may record data at varying intervals, leading to the presence of missing values in the matrix . To manage these missing values, a mask matrix is used, which indicates the presence or absence of data at each position in . Specifically, the th value of , denoted as , is set to 1 if the value in is not missing, and is set to 0 otherwise. The elementwise product can be used to zero out the missing values, resulting in a matrix .
Typically, a training dataset consists of data points. In imputation tasks, the primary objective is to fill in the missing values in , with the dataset represented as . For classification tasks, imputation might not be necessary if a reliable representation of the data can be learned directly from incomplete data. In this case, the dataset is denoted as , where represents the class label associated with each data point .
In MTS, each time series is collected by a sensor, and these sensors may exhibit connections, such as geographical proximity. To model the pairwise relationships between sensors, a graph structure is employed, where each node represents a sensor. Specifically, a graph is defined as , with representing the set of nodes, and being the adjacency matrix. Each node will be associated with a sensor. In this matrix, the element quantifies the correlation between nodes and , with a larger value indicating a higher correlation. As these relationships can change over time, the graphs corresponding to different time stamps are denoted as , representing the graphs dynamically learned at each time stamp from to , respectively.
3.2 Framework
The DynIMTS framework, depicted in Fig.2, is a recurrent architecture designed to leverage spatial-temporal properties for IMTS data. At its core, each recurrent unit is a spatial-temporal encoder composed of three components: graph learning, embedding learning and spatial-temporal learning.
In the graph learning component, the framework utilizes the sensor embeddings learned from the previous timestamp, . Each instance within the batch contributes its own unique sensor embeddings, reflecting distinct data characteristics. The primary objective of this component is to integrate the individual information into a cohesive, unified graph for time stamp . Initially, the edge weights for each data sample are carefully estimated to tailor the graph structure to the unique attributes of each instance. Following this, an instance-attention mechanism is employed to integrate these diverse graphs. This integration results in a unified graph that captures and reflects the inter-instance relationships, significantly enhancing the model’s capability to generalize across varied data samples.
The embedding learning begins by taking the current timestamp inputs for each node. These inputs are processed through an MLP which operates across all instances in batch mode, using shared graph structure and adjacency matrix learned in the graph learning component, although the embeddings are instance-specific. The MLP outputs new embeddings, which are then passed through a Graph Convolution Network (GCN) for message passing. This step leverages the graph structure to update each node’s representation by integrating neighbourhood embeddings, enriching the missing values’ information.
In the spatial-temporal learning component, the instance embeddings learned in the embedding learning component go through a recurrent graph convolution network to learn the sensor embedding, complemented by normalization and dropout techniques to stabilize and regularize the learning process. This produces a refined instance-wise sensor embedding for the current timestamp, which is fed back into the recurrent unit in the next time stamp. This cycle repeats, continuously evolving the graph’s structure and sensor embeddings.
Fig.2 Framework of DynIMTS. The model is a recurrent structure based on a spatial-temporal encoder and consists of three main components: embedding learning, spatial-temporal learning, and graph learning. The denotes samples in a Batch, and and is the corresponding graph structure. The final learned sensor embedding is used for classification |
Full size|PPT slide
Ultimately, the final graph representation serves as the input for task-specific components. Depending on the application, this could involve: i) A spatial-temporal decoder for imputation tasks, aiming to fill missing values by leveraging neighbourhood embeddings and the learned graph structure; ii) Several classification layers for classification tasks, focusing on harnessing the spatial-temporal features encoded in the graph. The network is trained end-to-end, optimizing either an imputation loss for data recovery tasks or a classification loss for predictive tasks. This training approach ensures that the DynIMTS framework adapts to the specific characters of the provided spatial-temporal data, maximizing its performance on the designated tasks. Through its recurrent structure and integration of graph-based learning with instance attention mechanisms, the DynIMTS framework offers a robust solution for IMTS data on either imputation or classification tasks.
In our approach, missing values at timestamp are addressed by updating sensor embeddings using a combination of methods. We employ the representations from the encoder model at timestamp and then decode these representations to impute the missing values using the decoder model. Both the encoder and decoder models are trained with data from timestamps 1 to . More specifically, we propose a two-step embedding learning process: transformation and message-passing. In the message-passing step, the strategy integrates information from neighboring observed nodes, effectively representing missing values as embeddings with integrated information. Following this, the sensor embedding is further updated in the spatial-temporal learning process, ensuring accurate representation and handling of missing values. In summary, the model incorporates a recurrent structure that leverages past data and a graph neural network that leverages neighborhood data for imputation.
3.3 Graph learning
Graph Learning aims to learn the connections or dependencies that exist amongst the sensors. Initially, instance-wise sensor embeddings learned in the timestamp for instance ’s th dimension, denoted as , are inputted. Based on such input, this process first computes the cosine similarity between sensor embeddings, which is used as the initialization of the adjacency matrix, that is
where the learned will form a matrix denoted as .
Instance Attention integrates the adjacency matrix learned for different instances into one. To do so, we utilize an instance-based attention mechanism based on the query, key and value matrices to calculate attention weights. This approach enhances the overall performance by acknowledging and responding to the inherent variation presented across multiple instances. We use single-head, and denote , and the parameter matrices specific to the head and the embedding dimension. Then we have
Then we have the integrated adjacency matrix as
in which is the integrated graph at time point , is the sigmoid activation function, is linear transformation parameters, gives the max-min normalization of the matrix and is the batch used in the previous round.
3.4 Embedding learning
Embedding learning aims to capture a latent representation of the initial data through a two-step process: transformation and message-passing. The transformation step involves converting the value from the th sensor at timestamp into a latent space of dimension . To incorporate the missing information, the is concatenated with , the th element of mask matrix . This transformation is mathematically represented as:
where denotes the transformation weights for the th sensor and gives a column-wise concatenation.
To accommodate the irregularity often found in data, where some embeddings may be null, we employ a message-passing function within a Graph Convolution Network (GCN). This function facilitates the substitution of null embeddings with integrated information from neighbouring nodes. We define nodes and as neighbours if their connectivity in the adjacency matrix exceeds a threshold, for instance, . Message passing is then conducted as follows:
where is the transformation parameters to be learned, represents the activation function (PReLU here), and is the adjacency matrix learned in the graph learning component. We denote the final integrated embedding after the for the th dimension at timestamp to be . Note that the learned is specifically for each instance and we can also denote it as if necessary.
3.5 Spatial-temporal learning
This component mainly learns the sensor emebddings specified to each instance. In the following description, we will omit in the description if the context is clear. In learning such embeddings, we employ a recurrent graph convolution network based on GRU. This approach seeks to extract temporal patterns of IMTS, similar to traditional RNNs, yet relies on a learned graph structure for calculation. To do so, we initialize the sensor embedding as
where gives a column-wise concatenation of two vectors. Overall, gives an integration of sensor embedding and the missing pattern.
Then we follow the GRU procedure to define the reset gate, update gate and candidate hidden state respectively in
where is the element-wise multiplication, , and are the weights to be learned. In the above equations, calculates the product between the updated after MP and , . Thus the final sensor embedding can be written as
where and is the embedding size here.
3.6 Data with high sampling frequency
For IMTS data with a high sampling rate, the historical data may be more reliable in providing essential information for the current timestamp. In this way, we modify the previously proposed learning strategies. Specifically, we modify three places: calculating the integrated adjacency matrix in the Graph Learning component, the transformation step in the embedding learning component, and the initialization of sensor embedding in the Spatial-temporal Learning component. Below we introduce them one by one.
For the initialization of sensor embedding in the Spatial-temporal Learning component, instead of directly using the learned , we first use a simple linear model to impute the missing data by
Then, we initialize the sensor embedding, we replace Eq. (5) by
in which .
For the integrated adjacency matrix in the Graph Learning component, we added a min-max normalization out of the sigmoid function. Specifically, we replace Eq. (2) with
For the transformation step in the embedding learning component, the historical information is added into Eq. (3), and we get the transformation by
where is now a matrix of size .
3.7 Imputation and classification
Below we introduce how we do imputation and classification based on the learned embeddings.
Imputation For the imputation task, we provide a decoding component to learn the missing values from the sensor embeddings . The most straightforward way of decoding can be a linear transformation, for example,
In real implementations, the
may further undergo RGCN and MP processes [
29], and we omit the details here. After having the estimated
, we can then optimize the MAE loss to learn the model.
Classification To do classification, we concatenate all dimensions’ embeddings and train a generic classifier such as MLP for the final prediction. Note that in our proposed framework, there are two kinds of embeddings, one is outputted by the Embedding Learning component, i.e., and another is the sensor embedding outputted by the Spatio-temporal Learning component, i.e., . We use both of them in the classification task and add their losses together. Specifically, we use the following loss function in classification
i.e., we concatenate the embeddings from different nodes and use an MLP to do the prediction.
3.8 Time complexity
We analyze the time complexity component by component. For the partial temporal encoder, in one time stamp, the embedding learning has the complexity of , where is the number of learnable parameters in the neural networks models, including both MLP and GCN. The quadratic increase in complexity with respect to the dimensionality is essential in our proposed method because it involves graph integration among individual series. The spatio-temporal learning follows a similar time complexity of , where is defined similarly to in this component. The graph learning component will have a time complexity as well depending linearly on and the number of parameters, and quadratically on . By adding this component together, we have an overall complexity of , where is the total number of learnable parameters in the proposed model. The decoding phase will double the overall complexity, and we omit the constant in the Big O notation.
4 Experiments and results
In this section, we conduct an empirical analysis of our proposed method, concentrating on imputation and classification tasks. We compare its performance with that of baseline methods. Additionally, we perform an ablation study. To illustrate the evolution of the learned graph, we provide visualizations. Our experiments were carried out using the PyTorch framework on NVIDIA RTX A5000 GPUs within a CUDA environment. The code for replicating the experiments is available online at the website of github.com/Evan-Kun/DynIMTS.
4.1 Imputation
Datasets We use the following datasets
●
Air quality index (AQI) [
29]: This time-series dataset, as detailed in [
48], comprises hourly PM2.5 level recordings from 437 monitoring stations across 43 cities in China, spanning from May 2014 to April 2015. The dataset is irregular in nature and has been extensively utilized to assess the effects of imputation, as reported in [
29,
36,
49]. For consistent evaluation, we follow the methodology described in [
29] to simulate irregular data patterns and construct a geometric distance-based graph structure.
●
AQI-36 [
29]: This dataset is a reduced version of AQI by incorporating the measure of only 36 cities. It was also used extensively [
29,
36,
49]. As AQI, we follow [
29] to simulate the irregular data patterns and construct the geometric distance-based graph.
●
Exchange-rate [
50]: Unlike the inherently irregular AQI and AQI-36 datasets, the Exchange-rate dataset is regular, containing daily currency exchange rates for one US dollar to 8 different currencies from 1990 to 2016. The graph structure for this dataset is developed by calculating the Pearson correlation between pairs of exchange-rate vectors. To introduce patterns of irregularity, data points are randomly omitted following a uniform distribution across the interval
. Given the originally complete nature of this dataset, it serves as an ideal basis to evaluate how our proposed algorithm performs under varying patterns of irregularity. Specifically, we vary the missing data ratio from
to demonstrate the effectiveness of our approach under different conditions.
Baselines For the task of imputing missing values in IMTS, we compare our model against the following baselines:
●
BRITS [
36]: This method utilizes a bidirectional recurrent neural network tailored for imputing missing values in multivariate time series. Since future information is unavailable at the time of imputation, the model is modified to function unidirectionally. Therefore, instead of comparing it with the original bidirectional RITS, we benchmark against a variant, which we refer to as RITS.
●
GRIN [
29]: This is a recurrent graph convolution neural network devised to impute missing values within multivariate time series. Similarly, only one direction is used to process data. Due to that a pre-defined graph structure is requested by running the GRIN method, we use two versions of it: GRIN (all-ones) and GRIN (physical). GRIN (all-ones) uses a fully connected graph with an all-ones adjacency matrix, while GRIN (physical) uses the physical graph structure, i.e., the graph of geometric distances in AQI and AQI-36, and the Pearson Correlation graph in Exchange Rate. Note that GRIN (physical) employs the groundtruth graph information in AQI datasets, so it is expected to perform better than our proposed DynIMTS, which has no access to the ground truth graph.
Settings For training the model, the dataset is partitioned into three subsets: 70% for training, 20% for validation, and 10% for testing. Each experiment is run 5 times and we report the mean and standard deviation (std) of the results. For each neural network, we tune the learning rate from , batch size from , hidden layer size from and sensor embedding dimension (for GRIN and DynIMTS) from . The number of epochs is capped at 200, with an early stopping mechanism that terminates the process if no improvement is observed in the validation set across 40 successive epochs.
Metrics Three metrics are used to evaluate the imputation effects: MAE (Mean Average Error), MSE (Mean Squared Error) and MRE (Mean Relative Error). Assuming the groudtruth of time , dimension is and the imputated value is , then MAE calculates the average of , MSE calculates the average of and MRE calculate the average of .
ResultsTab.1 demonstrates the performance of the proposed DynIMTS algorithm on the AQI-36 and AQI datasets. Among the methods without access to physical graph information, our DynIMTS consistently outperforms the others across all three metrics on both datasets. Even when compared to GRIN (physical)—which benefits from full access to the geometric distance graph—our DynIMTS demonstrates comparable or superior results in most cases. The larger graph size required by the AQI dataset, which presents more significant learning challenges, may account for our method’s superior performance on AQI-36 as opposed to AQI. Contrastively, the relative underperformance on the AQI may be attributed to the lack of essential graph information that GRIN (physical) utilizes.
Tab.1 Comparative performance of imputation models on AQI and AQI-36 datasets (mean ± std) |
| AQI-36 | | AQI |
---|
MAE | MSE | MRE | | MAE | MSE | MRE |
---|
GRIN (physical) | 15.8 ± 0.6 | 792.0 ± 93.9 | 0.5 ± 0.0 | | 17.4 ± 0.3 | 935.3 ± 34.3 | 0.5 ± 0.0 |
RITS | 16.8 ± 1.7 | 872.6 ± 184.5 | 0.7 ± 0.1 | | 22.8 ± 0.7 | 1437.7 ± 99.3 | 0.6 ± 0.0 |
GRIN (all-ones) | 15.7 ± 0.5 | 710.7 ± 45.7 | 0.6 ± 0.0 | | 19.8 ± 0.3 | 1089.0 ± 11.0 | 0.6 ± 0.0 |
DynIMTS | 14.9 ± 1.1 | 673.0 ± 149.7 | 0.5 ± 0.0 | | 18.3 ± 0.3 | 997.8 ± 70.4 | 0.5 ± 0.0 |
Fig.3 illustrates the performance comparison of our DynIMTS against baseline methods under missing rates of and on the Exchange Rate dataset. The results lead us to several conclusions: (i) an elevated missing rate negatively impacts all methods; (ii) our DynIMTS outperforms the baseline models in terms of MAE and MSE; (iii) however, it lags behind the baselines on MRE by approximately . The combination of a higher MAE and a lower MRE suggests that while our method excels at imputing larger values, it does so at the cost of precision with smaller values. It is worth noting that RITS performs less effectively than both the baselines and our DynIMTS, likely due to its disregard for graph structure. To clearly highlight the comparative advantages of DynIMTS over the GRIN methods, RITS is excluded from this visual illustration. We have also added Fig.4 to visualize the original time series and the predicted time series on the Exchange Rates Data. The plots across eight dimensions show that the predicted values closely align with the ground-truth values, capturing the overall trends and fluctuations accurately. While minor deviations are present in some dimensions, the general pattern of the data is well-represented.
Fig.3 Evaluation of imputation methods at 20% and 80% missing data ratios for exchange rates dataset. (a) MAE by missing ratio and method; (b) MSE by missing ratio and method; (c) MRE by missing ratio and method |
Full size|PPT slide
Fig.4 Visualization of the imputation results produced by our model with a 20% missing data ratio on the Exchange-rate dataset across eight dimensions. The black lines represent the ground truth values, while the orange lines represent the predicted values. (a) Dimension 1; (b) Dimension 2; (c) Dimension 3; (d) Dimension 4; (e) Dimension 5; (f) Dimension 6; (g) Dimension 7; (h) Dimension 8 |
Full size|PPT slide
4.2 Classification
Datasets For the classification task, we use the following datasets.
●
P12 [
51]: This dataset comprises binary classification data collected from 36 sensors monitoring 11,988 ICU patients over a 48-hour period. The data is inherently irregular, with the classification task focusing on determining whether patients’ ICU stays exceed three days. It is characterized as a low-frequency dataset due to the considerable time gaps between some measurements. There are some static features in this dataset that were excluded from the experiments.
●
P19 [
52]: For this binary classification dataset, we preprocess it following the methodology outlined in [
31]. The processed dataset is derived from data collected by 34 sensors monitoring 38,803 patients. The labels identify whether the patients develop sepsis within the subsequent 6 hours. Similar to P12, static features are excluded. Additionally, this dataset is characterized by its low-frequency data collection.
●
PAM [
53]: This multi-class dataset comprises eight classes, each representing a different daily living activity. Following the preprocessing methods outlined in [
31], the resultant dataset includes 5,333 samples. Each sample is collected from 17 sensors over 600 observations of daily activities, with a time gap of 0.01 seconds between each. The original dataset is regular; however, observations can be uniformly and randomly removed to simulate irregular patterns. This dataset is counted as high-frequency data.
●
AWR [
54,
55]: The dataset captures the movements of the tongue and lips during speech, specifically focusing on the articulation of 25 words. Each word serves as a label, and the data is sampled at a rate of 0.005 seconds. The dataset comprises 575 instances, each with a sequence length of 144 and across 9 dimensions. It is structured as a regular dataset but can be transformed into an irregular format through both uniform and random sampling techniques. This dataset is counted as high-frequency data.
For none of the dataset contains any geographical information, we simply use all-ones adjacency matrix for baseline methods requiring a graph input.
Baselines Our model’s performance is evaluated against the imputation methods
RITS and an all-ones variant of
GRIN. Subsequently, two fully connected layers are utilized for classification purposes. Additionally, we compare our results with the baseline method
Raindrop [
31], which is designed to learn representations for downstream tasks. Notably, the graph used in Raindrop is based on an all-ones adjacency matrix.
Settings The outputs of all methods are processed through two fully connected layers designed specifically for the classification task. All hyperparameters are tuned according to the procedures outlined in Section 4.1. So is the train/val/test split.
Metrics For classification tasks, we assess performance using AUROC [
56], AUPRC [
57], and Accuracy metrics. Among them, AUROC and AUPRC are particularly well-suited for evaluating the imbalanced classification tasks P12 and P19. Accuracy will be used to evaluate the multi-class balanced dataset PAM and AWR.
ResultsTab.2 presents an empirical comparison of various models on the P12 and P19 datasets. It is evident that on the P12 dataset, DynIMTS consistently outperformed all baseline models across both evaluated metrics. In contrast, on the P19 dataset, while DynIMTS generally surpassed other baselines, its performance was comparable to that of GRIN (all-ones). This variance in performance could be attributed to that the P19 dataset, collected at a lower frequency, results in shorter time-series data compared to the P12 dataset (60 vs. 215). This shorter time series results in less data being available to accurately learn and model the interconnections between sensors. Consequently, the proposed method, which leverages these interconnections in the form of a graph, has limited effectiveness, leading to only marginal improvement due to insufficient and potentially less reliable data from the shorter observation period.
Tab.2 Evaluation on P12 and P19 datasets using AUROC% and AUPRC% (mean ± std) |
| P12 | | P19 |
---|
AUROC | AUPRC | | AUROC | AUPRC |
---|
RITS | 71.2 ± 2.0 | 16.4 ± 2.7 | | 86.7 ± 0.7 | 47.2 ± 3.1 |
Raindrop | 60.9 ± 5.3 | 14.7 ± 2.0 | | 74.1 ± 0.8 | 33.6 ± 1.7 |
GRIN (all-ones) | 70.8 ± 2.0 | 12.5 ± 2.8 | | 87.1 ± 1.1 | 48.4 ± 2.0 |
DynIMTS | 75.1 ± 1.6 | 25.2 ± 3.0 | | 87.1 ± 1.1 | 48.4 ± 1.9 |
Tab.3 provides an evaluation of accuracy across two datasets at varying missing rates. It is apparent that the proposed DynIMTS model exhibits less sensitivity to increasing missing rates compared to baseline methods. In most cases, DynIMTS consistently outperforms the baseline models, demonstrating its robustness and superior performance even under conditions of high irregularity. The observation that classification performance improves when masking more data (such as from 50% of the data compared to 40%) seems counterintuitive. Upon examining the dataset, we found that as the missing ratio increases, the observed data points become sparser, resulting in smoother and flatter temporal patterns. As a result, the correlation of weakly related variables may be lost due to the higher missing ratio, leading the model to “ignore” these less significant patterns. We think the high missing ratio acts as a feature selector and regularizer, helping the model to generalize slightly better by ignoring some patterns and data points that might otherwise introduce noise or less relevant information. We will further explore this phenomenon in future work.
Tab.3 Accuracy% (mean ± std) on PAM and AWR datasets with various missing rates |
Missing ratio/% | Models | PAM | AWR |
10 | RITS | 87.8 ± 0.6 | 83.8 ± 2.8 |
10 | Raindrop | 75.5 ± 1.7 | 52.5 ± 4.8 |
10 | GRIN (all-ones) | 97.5 ± 0.6 | 79.1 ± 4.0 |
10 | DynIMTS | 97.4 ± 0.5 | 84.4 ± 3.7 |
20 | RITS | 79.1 ± 0.7 | 73.2 ± 16.8 |
20 | Raindrop | 75.5 ± 1.7 | 47.5 ± 5.8 |
20 | GRIN (all-ones) | 97.6 ± 0.2 | 80.0 ± 3.6 |
20 | DynIMTS | 98.3 ± 0.2 | 83.5 ± 3.2 |
30 | RITS | 73.5± 0.8 | 69.2 ± 14.5 |
30 | Raindrop | 70.4 ± 0.5 | 47.7 ± 6.6 |
30 | GRIN (all-ones) | 95.9 ± 1.2 | 77.4 ± 3.2 |
30 | DynIMTS | 97.2 ± 0.7 | 77.2 ± 5.7 |
40 | RITS | 71.9 ± 0.9 | 56.0 ± 24.6 |
40 | Raindrop | 64.0 ± 1.4 | 50.6 ± 5.2 |
40 | GRIN (all-ones) | 93.5 ± 0.6 | 76.5 ± 5.9 |
40 | DynIMTS | 94.0 ± 0.5 | 78.6 ± 2.2 |
50 | RITS | 61.1 ± 1.9 | 59.0 ± 14.0 |
50 | Raindrop | 58.3 ± 2.0 | 34.6 ± 3.8 |
50 | GRIN (all-ones) | 93.5 ± 0.6 | 76.5 ± 4.3 |
50 | DynIMTS | 95.8 ± 0.5 | 78.8 ± 6.6 |
4.3 Ablation study
This section presents an ablation study of our proposed DynIMTS, with results on the P12 and P19 datasets evaluated using AUROC and AUPRC, as reported in Tab.4. The findings demonstrate that: (i) Omitting the graph component leads to a significant decline in performance across both datasets, particularly in the smaller P12 dataset, underscoring the critical role of a well-structured graph. (ii) The absence of instance attention results in a performance drop, though it is less pronounced than the impact of excluding the graph component, especially in the P12 dataset. This further emphasizes the importance of the graph structure in the model. (iii) Including graph normalization also deteriorates performance, likely due to the disproportionate influence of edge values on the imputation process. Overall, these results confirm the necessity of each component in the DynIMTS architecture for achieving optimal performance.
Tab.4 Ablation study for the impact of graph, instance attention and graph normalisation |
| P12 | | P19 |
---|
AUROC | AUPRC | | AUROC | AUPRC |
---|
w/o graph | 57.9 ± 1.3 | 8.6 ± 0.7 | | 81.1 ± 1.6 | 39.1 ± 2.9 |
w graph normalisation | 59.3 ± 5.1 | 9.7 ± 1.6 | | 83.4 ± 2.1 | 42.1 ± 6.4 |
w/o instance attention | 69.3 ± 8.3 | 18.7 ± 9.0 | | 85.2 ± 2.4 | 46.0 ± 5.7 |
DynIMTS | 75.1 ± 1.6 | 25.2 ± 3.0 | | 87.8 ± 0.5 | 47.5 ± 3.5 |
4.4 Graph visualization
This section provides the visualization of graphs learned by DynIMTS in Fig.5. Specifically, we present the (learned) graph’s adjacency matrix on epochs 0, 10, and 20. We can see at the beginning of the training, the edge values exhibit a high degree of similarity. However, as training progresses, a noticeable variation in colour intensity among the edges becomes apparent, indicating a shift in the dependencies across different nodes. Furthermore, the lighter colours tend to become even lighter, while the darker colours intensify, until the graph reaches convergence.
Fig.5 Visualization of the adjacency matrices of epochs 0, 10 and 20 on the P19 dataset. The darker the colour is, the larger the value on the corresponding point. (a) Graph at epoch 0; (b) graph at epoch 10; (c) graph at epoch 20 |
Full size|PPT slide
Analysis of the derived sensor dependency graph reveals discernible relationships among various sensors. Sensors 0 through 7 represent vital signs (0: Heart Rate, HR; 1: Pulse oximetry, O2Sat; 2: Temperature, Temp; 3: Systolic BP, SBP; 4: Mean arterial pressure, MAP; 5: Diastolic BP, DBP; 6: Respiration rate, Resp; 7: End tidal carbon dioxide, EtCO2) whereas the remaining sensors capture laboratory values. Initially, all sensors exhibit correlations with sensors 0 (HR) and 1 (O2Sat). Over time, the relationships evolve such that 0 (HR) and 2 (Temp) predominantly correlate with other vital sign sensors, suggesting that heart rate and body temperature maintain strong associations primarily with vital parameters. Conversely, sensor 7 (EtCO2), initially not correlated with laboratory values, begins to show significant relationships with all laboratory measurements as time progresses, indicating its potential utility in predicting certain laboratory outcomes such as sensor 11 (pH) and 27 (Troponin). Furthermore, sensor 10 (Fraction of Inspired Oxygen, FiO2) initially displays connections across the sensor array. However, as time advances, it becomes apparent that FiO2 maintains fewer associations, particularly excluding sensors like 14 (Aspartate transaminase, AST), 15 (Blood urea nitrogen, BUN), 18 (Chloride), and 19 (Creatinine), while still correlating with sensor 28 (Hematocrit, Hct) and other vital signs.
5 Conclusion
In this paper, we have proposed a novel dynamic GNN framework tailored to address the complexities of IMTS data, characterized by irregular sampling and missing values. Our approach leverages an instance-attention mechanism to dynamically learn and update graph structures based on real-time data input. Empirical evaluations on real-world datasets demonstrate that our method significantly outperforms existing state-of-the-art solutions. In the future, our aim is to provide a theoretical guarantee for the proposed method.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}