Multi-classifier information fusion for human activity recognition in healthcare facilities

Da HU , Mengjun WANG , Shuai LI

Front. Eng ›› 2025, Vol. 12 ›› Issue (1) : 99 -116.

PDF (4245KB)
Front. Eng ›› 2025, Vol. 12 ›› Issue (1) : 99 -116. DOI: 10.1007/s42524-024-4074-y
Information Management and Information Systems
RESEARCH ARTICLE

Multi-classifier information fusion for human activity recognition in healthcare facilities

Author information +
History +
PDF (4245KB)

Abstract

In healthcare facilities, including hospitals, pathogen transmission can lead to infectious disease outbreaks, highlighting the need for effective disinfection protocols. Although disinfection robots offer a promising solution, their deployment is often hindered by their inability to accurately recognize human activities within these environments. Although numerous studies have addressed Human Activity Recognition (HAR), few have utilized scene graph features that capture the relationships between objects in a scene. To address this gap, our study proposes a novel hybrid multi-classifier information fusion method that combines scene graph analysis with visual feature extraction for enhanced HAR in healthcare settings. We first extract scene graphs, complete with node and edge attributes, from images and use a graph classification network with a graph attention mechanism for activity recognition. Concurrently, we employ Swin Transformer and convolutional neural network models to extract visual features from the same images. The outputs from these three models are then integrated using a hybrid information fusion approach based on Dempster-Shafer theory and a weighted majority vote. Our method is evaluated on a newly compiled hospital activity data set, consisting of 5,770 images across 25 activity categories. The results demonstrate an accuracy of 90.59%, a recall of 90.16%, and a precision of 90.31%, outperforming existing HAR methods and showing its potential for practical applications in healthcare environments.

Graphical abstract

Keywords

human activity classification / scene graph / graph neural network / multi-classifier fusion / healthcare facility

Cite this article

Download citation ▾
Da HU, Mengjun WANG, Shuai LI. Multi-classifier information fusion for human activity recognition in healthcare facilities. Front. Eng, 2025, 12(1): 99-116 DOI:10.1007/s42524-024-4074-y

登录浏览全文

4963

注册一个新账户 忘记密码

1 Introduction

Large gatherings, particularly in healthcare settings, are recognized as critical points for pathogen spread. The Centers for Disease Control and Prevention (CDC) reports numerous outbreaks of infectious diseases in hospitals, exerting significant strain on healthcare resources and resulting in a high number of fatalities (CDC, 2024). Annually, approximately 1.7 million people acquire infections during hospital treatment, leading to around 99,000 deaths (Haque et al., 2018). Moreover, hospital-acquired infections incur an estimated annual direct medical cost of about $30 billion and societal costs exceeding $10 billion due to premature mortality and reduced productivity. The recent global COVID-19 pandemic has further emphasized this issue, with over 449 million infections and 6 million deaths worldwide, and the numbers continue to rise due to more transmissible variants (Dong et al., 2020). In healthcare settings, maintaining proper sanitation and implementing effective disinfection practices are crucial in controlling the spread of infectious diseases (Assadian et al., 2021). This is where disinfection robots can play a pivotal role in effectively mitigating disease transmission.

Traditional cleaning and disinfection methods in healthcare facilities are manual, laborious, time-consuming, and prone to inconsistency due to human fatigue, potentially resulting in overlooked or insufficiently sanitized surfaces (Rutala and Weber, 2016). Robots can provide a solution to these challenges, operating continuously and consistently without endangering humans (Guettari et al., 2021). The COVID-19 pandemic has further emphasized the importance of disinfection robots in performing environmental cleaning tasks (Zemmar et al., 2020). In our prior research, we developed algorithms for object detection (Hu et al., 2023), contaminated area segmentation (Hu et al., 2020), and material classification (Hu and Li, 2022) specifically for disinfection robots in healthcare facilities. However, a significant limitation remains: current disinfection robots lack the ability to recognize human activities within healthcare environments. This limitation undermines their effectiveness and hinders their widespread implementation. Healthcare facilities include a wide range of human activities, each of which influences how a disinfection robot should operate. For example, if a doctor’s consultation is in progress, the robot should ideally postpone disinfection of that area to prevent disruptions. Therefore, the accurate recognition of human activities is a crucial prerequisite for the successful integration of disinfection robots into healthcare environments.

Human Activity Recognition (HAR) has been an area of study for over 20 years and has produced various methods for classifying activities from visual data such as videos or snapshots (Chen et al., 2021). Our study focuses on the classification of human activities from static images, aiming to enhance the operational efficiency of disinfection robots in healthcare environments. While video data provides a dynamic sequence of spatial and temporal cues, many actions can be discerned from single images or selected video frames, enabling prompt situational analysis (Guo and Lai, 2014). Additionally, the computational requirements for processing video data are significantly higher compared to still images (Rodríguez-Moreno et al., 2019), making image-based recognition more practical for disinfection robots. However, identifying human activities from a single frame presents distinct challenges, particularly in the presence of environmental noise or complex backdrops.

To address these challenges, our research proposes a novel approach that combines classifier techniques and leverages both scene structure and visual cues. Specifically, we have created a specialized image data set for healthcare environments, carefully annotated with precise activity labels. This data set serves as a valuable resource for testing activity classification models in healthcare settings, addressing the scarcity of specialized data available for HAR in the healthcare field. Additionally, our developed process transforms images into scene graphs, which highlight the contextual relationships within a scene using nodes and edges. This aids in the effective execution and evaluation of scene graph classifiers, as it enables the model to understand and utilize the spatial and relational context of activities. Furthermore, we integrate the strengths of Convolutional Neural Network (CNN), Visual Transformer (ViT), and Graph Neural Network (GNN) to extract a comprehensive range of features from images. Our fusion strategy combines Dempster-Shafer theory (DST) and weighted majority voting, resulting in a significant advancement in multi-classifier fusion for image classification.

The structure of this paper is as follows: Section 2 provides a literature review on the relevant work in the HAR field, while Section 3 presents the proposed model in detail. Section 4 discusses the experiments conducted and the results obtained, while Section 5 evaluates the proposed data set and model by comparing them with state-of-the-art methods. Finally, Section 6 concludes the study and discusses its potential applications.

2 Literature review

HAR plays a crucial role in studying human behavior and enhancing human-robot interfaces, including a wide range of activities from physical movements like jumping to daily tasks such as watching TV or making phone calls. There are two primary techniques in activity recognition: visual-based approaches and sensor-based approaches, each utilizing different types of data (Dang et al., 2020). Visual-based methodologies rely on image and video data captured by camera systems, such as RGB and RGB-D cameras. For instance, Uyguroğlu et al. (2024) investigated Alzheimer’s disease classification using the fusion of multiple 3D angular orientations through CNN. On the other hand, sensor-based approaches employ various sensors like accelerometers, WiFi, and radar to detect activities. Studies conducted by Mudiyanselage et al. (2021) and Raza et al. (2023) have demonstrated the potential of sensors in activity recognition and related risk control. In the field of robotics, cameras are commonly used to enable robots to perceive and interact with their surroundings. Visual-based HAR can be categorized into three main types: traditional machine learning techniques, deep learning models, and the fusion of different data types for feature extraction.

2.1 Machine learning for HAR

For several years, machine learning techniques like Support Vector Machine (SVM) and Random Forest have been widely used in the field of identifying human actions from images. SVM aims to find the optimal hyperplane for classification, whereas Random Forest utilizes an ensemble of decision trees. These methods have shown success in less complex scenarios with a smaller number of samples. Earlier studies, such as Wang et al. (2006), employed unsupervised methods to categorize actions by comparing silhouettes in images. Building upon this work, Ikizler et al. (2008) refined the approach by evaluating human poses through shape histograms and reducing data before classification. Yao et al. (2011) proposed the “Stanford 40 Action” benchmark data set and proposed a method that simultaneously models action attributes and parts using Locality-constrained Linear Coding (LLC) on dense SIFT features and a linear SVM classifier. In a study by Yun et al. (2013), hand-centered image patches were the focal point in classifying activities using SVMs. However, SVM encounters challenges with large-scale data sets due to increasing computational complexity, while the effectiveness of Random Forest may diminish due to the complex nature of decision trees and resource-intensive computational demands. Although these approaches have been effective, they often struggle with larger data sets because of their computational requirements and reliance on manually designed features, which may not scale well (Rashidi Nasab and Elzarka, 2023).

2.2 Advances in deep learning for HAR

With the availability of better computational resources, deep learning has emerged as a prominent method for image-based activity recognition, reducing the need for manual feature engineering (Yao et al., 2022). Oquab et al. (2014) found that CNNs trained on general images could enhance HAR. Following this trend, Gkioxari et al. (2015a) and Zhao et al. (2017b) advanced the field by incorporating contextual information and body part details. However, these methods often require additional manual annotations, which might limit their applicability in real-world scenarios.

To overcome limitations in HAR that arise from manually annotating bounding boxes around individuals, researchers have proposed innovative solutions. Khan et al. (2015) proposed a detection technique that utilizes features extracted by CNNs. Another approach, presented by Siyal et al. (2020), employed CNNs for feature extraction alongside SVMs for classification. Bera et al. (2021) took this further by integrating a CNN that specifically focuses on key areas of an image. They generated keypoints using a SIFT algorithm, grouped them using a Gaussian Mixture Model (GMM) to create salient regions, and then processed these regions with an attentional module for activity classification. Bas and Ikizler-Cinbis (2022) proposed an attentional deep multiple-instance learning network that identifies action-related regions and generates a pixel-level action map without irrelevant pixels. While these methodologies yield promising results, they primarily focus on visual features and neglect the valuable object and relationship information present in images. To address this, our study utilizes scene graphs extracted from images to capture complex relationships and object-specific information. We process these scene graphs using a GNN, a technique that has shown effectiveness in hyperspectral image classification (Hong et al., 2021, Ding et al., 2022, 2023). By combining visual aspects with the complex interplay of objects and relationships, our approach aims for a more holistic and accurate HAR.

2.3 Integrating multiple data types for HAR

Incorporating data from multiple sources has proven beneficial in interpreting complex human activities, as it provides rich semantic knowledge (Dang et al., 2020). Khaire et al. (2018) merged visual, depth, and structural data to classify actions, demonstrating improved resilience to suboptimal lighting conditions compared to using visual data alone. Guo et al. (2018) developed a technique that combines signals from WiFi and cameras to detect activities, proving particularly useful in challenging indoor environments with low light and obstructions. Singh et al. (2020) proposed a multi-view CNN approach that utilizes motion and depth information to enhance HAR, achieving high accuracy on demanding data sets such as MSR Daily Activity, UTD MHAD, and CAD 60. Although these methods have made significant advancements, they heavily rely on data from supplementary sensors. There is still a gap in leveraging the relationships between objects within an image for activity classification. A method that considers these relationships would provide a more comprehensive understanding of visual content and ultimately enhance the overall performance of HAR.

3 Methodology

This research presents a novel approach to decision fusion that combines visual cues and scene graph data to identify human actions within healthcare settings. The methodology, depicted in Fig.1, consists of four main steps. First, the robot’s camera acquires images. Second, three classifiers are utilized: a CNN using ConvNext (Liu et al., 2022), a ViT employing Swin Transformer (Liu et al., 2021), and a GNN based on an unbiased scene graph generation approach (Tang et al., 2020). These classifiers aim to recognize features from the image, focusing on spatial patterns, global context, and object relationships, respectively. Third, a hybrid decision fusion stage combines the probability distributions (PDs) of the activity categories output by these three classifiers using the DST and weighted majority vote, taking into account the level of conflict among the classifiers’ outputs. Finally, the fused output is used to recognize and classify the human activity depicted in the images, providing valuable insights for monitoring and managing healthcare facilities.

3.1 Scene graph-based activity classification

Scene graph features offer a more comprehensive contextual understanding by capturing the relationships between objects. This is crucial for distinguishing relevant activities from background clutter and representing the complex interactions commonly found in healthcare environments. To achieve this, we leverage the unbiased scene graph generation method developed by Tang et al. (2020). This method employs counterfactual causality to identify and mitigate unwanted biases. It integrates with existing scene graph generation models such as MOTIFS (Zellers et al., 2018), Iterative Message Passing (IMP) (Xu et al., 2017), and VCTree (Tang et al., 2019). In our approach, we specifically utilize the MOTIFS model, which has been trained on the extensive Visual Genome data set (Krishna et al., 2017), to generate scene graphs. By combining MOTIFS with causal inference methods, we enhance our ability to accurately detect and interpret scene graphs, allowing us to understand complex visual content without pre-existing biases.

For the construction of scene graphs, we establish connections and define nodes using pairs with confidence levels above 0.5. This approach ensures a balance between detail and computational load. The selection of this threshold is based on common practices in the field and preliminary experiments, which have shown that a threshold of 0.5 effectively filters out less reliable relationships while preserving valuable information.

However, it is important to note that our experiments and training were conducted on a workstation, not on the actual disinfection robots. These robots have significantly lower computational power, and therefore the 0.5 threshold may need to be reconfigured when transferring the system to accommodate their limited capabilities.

To improve the representation of our graph elements and conserve time and computational effort, we utilize the Skip-gram model from Word2Vec. This model is trained on the extensive Google News data set and creates 300-dimensional vectors that capture nuanced language patterns for a wide range of words and phrases. The embeddings derived from Word2Vec enrich our node and edge features, providing a strong foundation for classifying the scene graphs with image categories as labels.

Fig.2 presents sample scene graphs generated from our image data. These examples are limited to the top 10 most confident objects and relationships for clarity. However, our method allows for more comprehensive extraction when necessary. This graph-based approach captures essential interaction data within images, providing valuable insights for effective HAR, such as identifying a group of medical professionals gathered for a conference.

The purpose of our study is to develop a graph classification network that can accurately classify human activity based on generated scene graphs. Fig.3 provides an overview of the flowchart for the GNN, which includes feature encoding and feature classification stages. We have utilized the graph attention mechanism in our network to effectively learn from graph-structured data by focusing on the neighboring nodes of each node. The Graph Attention Network (GAN), proposed by Veličković et al. (2017), has emerged as a popular architecture for representation learning with graphs. The GAN computes node similarity within a graph to uncover the hidden features of the graph nodes. A recent innovation is the dynamic graph attention variant proposed by Brody et al. (2021), which proposes a simple modification to enhance the model’s ability to fit the training data. In our study, we have employed the proposed graph attention mechanism to encode the graph features. The GAN operator operates by attending to a node’s neighbors to learn a representation that captures the node’s local graph structure. This attention mechanism allows the network to focus on the most pertinent components of the graph, thereby facilitating more effective feature learning and classification. The details of the GAN operator are as follows:

The GAN takes node features, h={ h1, h2, , hN}, hi RF, and edge features, ei,j RD as inputs, where N is the total nodes, F is the node feature count, and D is the edge feature count. ei,j are considered only when connecting node i and node j. The GAN’s output is an enhanced node feature set, h={h1, h2,, hN },hi=RF. The GAN process begins with a shared linear transformation applied to each node. In Eq. (1), attention coefficients μ ij, signifying the influence of node j’s feature on node i, are computed through concatenation and weighted by WRF ×(2F+D) .

μ ij=W( hi hj ei,j).

These coefficients are calculated for the neighboring nodes kNi to maintain the graph’s structure. Normalization of these coefficients is accomplished using a LeakyReLU activation (with a slope of 0.2) according to Eq. (2), where a RF×RF is the learnable weight vector. ensuring a balanced distribution of attention:

α ij= ex p(aTL eak yR eLU( μij )) k Nie xp ( aTL ea kyR eL U( μi k)).

By employing a multi-head attention approach, the model calculates and averages distinct attention outputs to enhance performance and stabilize learning. Equation (3) defines this process, where K represents the attention count and σ signifies the ReLU function:

hi=σ(1K k= 1K jNiαijkW khj).

After the GAN operations, global average pooling is applied to aggregate node features, and a dense layer is used for classification. With a mere 0.11 million parameters, the GNN ensures efficiency in both training and inference times.

3.2 Transformer-based activity classification

Our research utilizes the Swin Transformer model, which is widely recognized for its effectiveness in visual feature learning for activity recognition tasks (Liu et al., 2021). The Swin_L architecture, an expanded version of the Swin Transformer, is summarized in Fig.4. It follows a patch partitioning mechanism to divide the input image into distinct patches. Each patch’s raw pixels form a 48-dimensional vector when using a 4 × 4 patch size. These vectors are then refined through a series of transformer stages. The first stage consists of a linear embedding layer that converts the 48-dimensional vectors into 192-dimensional vectors, followed by two Swin Transformer blocks. Stages 2 through 4 incorporate a patch merging step, which combines features from neighboring patches and reduces dimensions before entering additional transformer blocks. Each stage includes 2, 18, and 2 Swin Transformer blocks, respectively. Swin Transformer blocks utilize a shifted window Mechanism for Self-Attention (MSA) and contain dual integrated Multilayer Perceptron (MLP) layers with GELU activations, all preceded by layer normalization. This hierarchical structure enables comprehensive activity classification through tiered feature synthesis.

3.3 CNN-based activity classification

For activity classification using CNN, we have chosen the State-of-the-Art (SotA) ConvNeXt model. Based on the well-established ResNet framework, ConvNeXt incorporates several refinements to improve performance. Fig.5 presents the ConvNeXt_L setup, which includes an initial convolution, successive stages of ConvNeXt units, and a classification layer. The process begins with a 4 × 4 convolution to reduce the feature map size, followed by four tiers of ConvNeXt units, each consisting of 3, 3, 27, and 3 units. As the layers deepen, the feature map’s resolution decreases while the channel count increases. After the final stage, an adaptive pooling layer condenses the feature map to 1 × 1, leading to a dense layer that performs the final activity classification. For a detailed explanation of the network, please refer to (Liu et al., 2022).

3.4 Decision fusion

This section presents our decision fusion strategy, which effectively combines the outputs of the GNN, Swin_L, and ConvNeXt_L models to enhance robustness, accuracy, and adaptability in varying lighting conditions. The integration of multiple classifiers through multi-classifier fusion leverages their respective strengths. Specifically, this approach capitalizes on the CNN’s proficiency in identifying spatial patterns, the ViT’s ability to capture global context, and the GNN’s effectiveness in understanding object relationships and interactions. We propose a novel fusion method that combines DST with a weighted majority voting system to establish a coherent decision framework. This hybrid methodology incorporates a weighted majority vote mechanism to enhance the decision fusion process by capitalizing on the advantages of both DST and majority voting.

To tailor our decision fusion framework, we have adapted the DST (Rogova, 2008), considering the unique classification strengths of individual classifiers to gauge the reliability of their decisions. Decision templates based on typical classification patterns observed during training are utilized as reference points for evaluating new data. Equation (4) defines the decision template DTj for class wj, where Z represents the data set, Nj denotes the number of elements of Z from wj, DP signifies the decision outputs from the classifier with probabilities, and c represents the total number of classes.

DTj=1Nj zk wj, zkZDP(zk),j =1,2,,c.

The decision profile DP( x) and decision template DT j are Lxc-dimensional matrices. L is the number of classifiers. Let D Tji denote the ith row the decision template D Tj. Let Di(x) =[di,1(x),, di,c(x )] denote the ith row of the decision profile DP( x) from the classifier Di. The proximity between D Tji and the output of classifier Di for the input x is given in Eq. (5), where represents L1 norm, and ϕj, i(x) is the proximity for class j=1,2,,c and classifier i=1,2,,L.

ϕ j,i(x)=(1 + DT ji Di( x)) 1k=1c(1+DTki Di(x))1.

Equation (6) calculates the belief degree.

bj(Di(x) )=ϕj,i( x) kj( 1ϕk ,i( x))1ϕj,i( x) kj( 1ϕk ,i( x)).

Equation (7) calculates the final degrees of support, where β is the normalizing constant.

μ j (x)=β i=1Lbj(Di(x)).

For high-conflict scenarios within our decision fusion process, we shift to a weighted majority vote to bolster the dependability of our fusion method. The conflict value K is defined in Eq. (8), where Diki denotes the ith row and kith column of the decision profile DP(x).

K=k1 kc= D1k1(x) D 2k2(x) DLkc(x).

Under the assumption that a conflict value (K) greater than 0.6 indicates a significant conflict between different classifiers (Daniel and Lauffenburger, 2011), the weighted majority vote is given by Eq. (9), where M represents the 1×L weight matrix calculated based on the accuracy of the classifier on the validation set, D Pk( x) represents the kth column of the decision profile DP(x). The threshold of 0.6 was determined empirically through extensive experimentation, balancing sensitivity and accuracy. The weighted vote is calibrated based on each classifier’s validation set accuracy, with classifiers demonstrating higher accuracy receiving greater influence in the final fusion decision.

μ j (x)=M×D Pk( x).

In a demonstrative scenario using synthetic data, we present three decision templates (given in Eq. (10)) to clarify the fusion approach.

D T1=[ 0.60.3 0.10.70.20.10.80.10.1],DT2= [0.20.60.20.10.70.20.20.70.1],DT3=[ 0.10.30.6 0.20.20.60.2 0.10.7].

The decision profile for input x the three classifiers is described in Eq. (11)

DP(x)=[ 0.40.4 0.20.30.60.10.40.10.5].

0.55 is the calculated result of the conflict value K using Eq. (8), which is smaller than 0.6 where DST will be used. Using Eq. (5), the proximities for each decision template are calculated and shown in Tab.1.

Tab.2 shows the belief degrees and final degree of supports calculated using the Eqs. (6) and (7). In this example, w2 performs slightly better on the fusion method.

4 Experiment and results

4.1 Data set description

Our data set creation involved two stages: downloading images and cleaning the data. We identified 25 common activities in healthcare settings, each with implications for robotic disinfection, as detailed in Tab.3. For example, the activity of “comforting” involves close physical contact, indicating a need for thorough disinfection in areas where such interactions are frequent. In contrast, during “doctor meetings,” the disinfection robot could focus on other high-risk areas due to the gathering of multiple individuals. Similarly, “infusing,” a routine procedure, requires regular disinfection in the areas where it takes place to maintain hygiene standards.

Images were sourced from Getty and Shutterstock, which offer extensive image libraries with textual descriptions corresponding to visual content. Approximately 1,000 images were acquired for each hospital activity. The data set was further refined through a verification process, where mislabeled images were either removed or reassigned to the correct categories. We also discarded images of poor quality, such as blurry or unclear ones that did not clearly depict the intended human activity. Despite these exclusions, our data set retains substantial diversity in terms of background, human pose, and appearance within each category. To improve the data set, the final selection of images was reviewed by an independent human labeler. The completed data set contains between 180 and 396 images per class, totaling 5,770 images.

Tab.4 provides a comprehensive breakdown of the data set used in this research. It categorizes the images into different classes, along with the quantity of images in each class. Additionally, brief descriptions are included for each class to highlight their unique characteristics and attributes. This detailed breakdown enhances the reader’s understanding of the data set’s scope and diversity.

Fig.6 shows a curated collection of images that represent the wide range of activities taking place in healthcare settings. Each of the 25 identified activities is depicted by a carefully chosen representative image from our data set. These selected images vividly portray the specific environments and scenarios associated with each activity, emphasizing the diversity and nuances captured in our study.

In our analysis of activity patterns, we employed the Barnes-Hut t-SNE method (Maaten and Hinton, 2008) to visualize the raw image data in a two-dimensional plane. t-SNE is widely recognized for its effectiveness in reducing the dimensionality of high-dimensional data and projecting it onto a lower-dimensional space. The process involves transforming the similarity between image samples into conditional probabilities and minimizing the Kullback-Leibler divergence between these low-dimensional embedding conditional probabilities and the high-dimensional data. Considering the large number of image features, we initially used Principal Component Analysis (PCA) to reduce the dimensionality to 50, following the recommendation of Kobak and Berens (2019). These 50-dimensional features were then inputted into the t-SNE algorithm to convert them into a two-dimensional representation. It is important to note that the input images were resized to 224 × 224 pixels with three channels, resulting in a total of 150,528 features for each image. The perplexity value for the t-SNE analysis was set at 30.

Fig.7 illustrates the t-SNE visualization, which displays the two-dimensional distribution of image data from a hospital setting. It reveals a convergence of features from different activities, indicating similarities in human activities that could pose challenges for automated recognition systems. This similarity is primarily attributed to the images being captured within healthcare facilities, which often share similar environmental backgrounds.

4.2 Implementation details

Our deep learning models were developed using the PyTorch framework (Paszke et al., 2019) and trained on an Ubuntu workstation equipped with two NVIDIA Quadro P5000 graphics cards. We initialized the Swin_L and ConvNeXt_L models by using pretrained weights from ImageNet. To ensure consistency, we employed a uniform set of training parameters for all models, including the GNN. The Stochastic Gradient Descent (SGD) optimizer (Ruder, 2017) was used for network optimization, with a weight decay of 5e-4 and a momentum of 0.9. These parameter values were chosen based on a combination of recommendations from the literature and empirical testing to achieve optimal performance. The initial learning rate for SGD was set at 1e-3, and it was reduced by half every 20 epochs to strike a balance between training convergence speed and stability. Considering hardware limitations, we set the batch size to 24 and trained the networks for a total of 50 epochs. During training, model weights were adjusted using cross-entropy loss, a commonly employed choice for classification models. The hospital activity data set was randomly divided into training (80%), validation (10%), and testing (10%) sets. To assess the effectiveness of our method, we generated five random splits and selected the most promising results from the validation phase to inform decision fusion. Finally, the performance of the fusion model was evaluated on the test set using standard evaluation metrics, including accuracy, recall, precision, and the F1 score (Powers, 2010).

4.3 Results

In this section, we present the results of our experiments. Fig.8 illustrates the progression of loss values for the Swin_L, ConvNeXt_L, and GNN models during training and validation. The loss trends for ConvNeXt_L and Swin_L converge, reaching low loss values by the 30th epoch, indicating successful learning of activity representation features. To enhance this learning process and improve model generalizability, we incorporated data augmentation techniques such as varying crops and image transformations to diversify the training data. The GNN, although initially exhibiting higher loss, consistently improved over the course of 50 epochs, suggesting potential benefits from extending the training duration. During the validation phase, ConvNeXt_L attained lower loss values earlier than Swin_L, with GNN stabilizing slightly later. This indicates an opportunity for performance optimization in the GNN when compared to the other models.

Tab.5 presents the performance metrics of each classifier, including our proposed method, on the testing set. These metrics, such as mean and standard deviation, are derived from the results of five random splits. The provided metrics are calculated from five random splits of the data set to ensure robust cross-validation. Among the classifiers, ConvNeXt_L stands out in performance, exhibiting high scores across accuracy, F1, precision, and recall metrics of 89.51%, 89.50%, 90.17%, and 89.30%, respectively. However, the GNN records lower scores in comparison, with F1 score, precision, and recall of 63.84%, 62.78%, 63.54%, and 63.16%, respectively. This difference may be attributed to the similar environmental contexts across different categories in healthcare facilities, leading to similarities in the scene graphs representing these images.

Fig.9 displays the confusion matrix for the three classifiers on the testing data set. The confusion matrix aggregates the results across five random splits to account for variability among different splits. It is row-normalized, with the diagonal representing the predictive power of the model for positive classes. For the GNN, recall varies significantly across different human activities, with ‘measuring blood pressure’ and ‘comforting patients’ being the most frequently misclassified activities, having recalls of 0.26 and 0.33, respectively. In contrast, ‘online meeting’ shows the best performance under GNN, with a recall of 0.95. Swin_L’s classifier stands out, showing perfect recall in activities like ‘infusing,’ ‘operating,’ and several others, demonstrating its robustness in accurate activity classification. Similarly, ConvNeXt_L achieves a recall of 1 in a broad range of activities, from ‘infusing’ to ‘analyzing,’ exhibiting its capability in consistently recognizing varied actions within healthcare settings.

Tab.4 illustrates the average performance of our proposed approach, surpassing 90% in accuracy, F1 score, precision, and recall, as computed from test data across different splits. This reflects a slight but consistent enhancement in all metrics compared to the ConvNeXt_L model. Fig.10 displays the confusion matrix of our proposed method, revealing a recall spectrum ranging from moderate to perfect across activities, with ‘comforting patients’ being on the lower end. This suggests that activities involving close patient interaction, such as ‘comforting,’ are more challenging to distinguish due to their contextual similarity to other patient-care actions.

4.4 Ablation study

In the conducted ablation study, we evaluated the individual and collective contributions of classifier pairs using our fusion method. The results in Tab.6 demonstrate that combining Swin_L and ConvNeXt_L achieves an impressive accuracy rate of 90.49%, outperforming each classifier alone by a margin of 0.98%. The introduction of GNN to either Swin_L or ConvNeXt_L leads to slight accuracy gains, indicating the complementary nature of these classifiers. Moreover, the combined use of all three classifiers highlights the strength of collaborative fusion in improving classification results.

However, it is important to note that despite GNN’s fast inference time of 1.46 ms, the scene graph generation process takes approximately 100 ms, making the overall process slower compared to other methods. Nevertheless, the enhanced representation of scenes provided by GNN justifies its inclusion. This study demonstrates the potential of combining diverse classifier types in a hybrid model, where each classifier captures unique aspects of image data, resulting in a more comprehensive scene analysis. Furthermore, this research could serve as a foundation for future efforts to refine and optimize this approach, potentially leading to notable accuracy enhancements. The balance between accuracy and processing time can be adjusted to cater to specific use case requirements. Additionally, future work could explore faster scene graph generation methods, such as CV-SGG (Jin et al., 2023), to enhance the overall efficiency of the system.

5 Discussion

5.1 Comparison with state-of-the-arts

This study proposes a novel approach for recognizing human activities in hospitals. We achieved this by constructing a comprehensive data set and generating scene graphs from images. Our approach includes a specialized GNN for scene graph analysis and an innovative amalgamated multi-classifier fusion approach that synthesizes results from GNN, Swin_L, and ConvNeXt_L. The system demonstrates promising results on a dedicated hospital activity data set. Furthermore, we evaluated our model’s performance using two benchmark data sets:

Stanford 40 action data set. This data set consists of 9,532 images across 40 action categories, depicting various everyday human activities such as running, walking, and applauding. Each category contains 180 to 300 images, ensuring a diverse and extensive collection. The data set provides a predefined train-test split, with each training set comprising 100 images per action category.

PASCAL VOC 2012. This data set includes 10 human actions, such as playing an instrument, riding a bike, and using a computer. It consists of 2,296 training images and 2,292 validation images. To align with the methods being compared against, we evaluate our model’s performance on the validation set. Notably, images depicting multiple actions are excluded from both the training and validation sets, ensuring that our networks are trained and tested on images associated with a single action.

Performance on Stanford 40 action data set.Tab.7 compares our method with SotA approaches on the Stanford 40 action data set. Many existing action recognition methods rely on bounding-box annotations, as provided by data sets like Stanford 40, to enhance their performance. These methods utilize contextual information and semantic parts around or within person bounding boxes to improve results. In contrast, our method does not require bounding-box annotations or the detection of humans and objects in images. Instead, it autonomously learns visual and graph features for action representation. Our model achieves the highest performance on the Stanford 40 data set, with a mean average precision (mAP) of 96.6%, surpassing the SotA Attend and Guide method by 0.4%. The Multi-Scale Context Network (MSCNet), which employs learning human bodies and action-specific semantic parts, achieves the second-best performance with a mAP of 94.6%. Notably, our approach shows a significant improvement over MSCNet, with a 2% increase in mAP. Furthermore, our model outperforms other methods by a substantial margin, ranging from 5.4% upwards.

Performance on PASCAL VOC 2012 action data set.Tab.8 presents a comparison between our method and SotA approaches. Many existing action recognition methods heavily rely on manually annotated bounding boxes to enhance their predictive power. For example, R*CNN achieves a mAP of 87.9% by incorporating annotated human bounding boxes, but its performance drops to 84.9% without such annotations. However, depending on manual annotations is impractical in real-world applications. Significant improvements in performance on the PASCAL VOC 2012 data set were only achieved with the introduction of the Top-Down + Bottom-Up Attention network (Bas and Ikizler-Cinbis, 2022), which achieves a mAP of 95.0%. Our method matches the performance of this network, placing it among the SotA methods. Notably, our approach outperforms the Top-Down + Bottom-Up Attention network by 5.6% on the Stanford 40 action data set. This comparison with SotA methods highlights the effectiveness and efficiency of our approach for human action recognition tasks.

5.2 Performance of GNN

The performance of GNN is significantly lower compared to the other two classifiers, Swin_L and ConvNeXt_L. Specifically, GNN achieves accuracies of 63.84%, 59.38%, and 66.15% on our hospital activity data set, Stanford 40, and PASCAL VOC 2012, respectively. The relatively lower performance on our hospital activity data set can be attributed to the similarities in the environment of certain activities. For instance, activities such as “Infusing” and “Injecting” occur in similar healthcare settings, where a healthcare professional interacts with a patient near a bed or medical workstation, with essential medical equipment present like an IV stand, needles, or syringes. These common elements are captured by the scene graph, posing a challenge for the GNN model to differentiate between these activities. To demonstrate this, we specifically selected ‘scanning’, ‘Eating’, ‘Operating’, ‘Carrying’, ‘Analyzing’, and ‘Wheelchair_1’ for classification, as they have distinct environmental backgrounds. The results indicate that GNN achieves an average accuracy of 88.81% on this subset of activities. The relatively lower performance on the Stanford 40 data set is due to certain inherent characteristics of this data set. Actions like ‘blowing bubbles’, ‘smoking’, ‘running’, or ‘waving hands’ mainly focus on a single human with minimal background context. From the perspective of the scene graph, these scenarios primarily capture the human figure, providing limited additional contextual information for effective classification. The same issue is observed with the PASCAL VOC 2012 data set, where the lack of contextual information in many images similarly hampers the performance of the GNN model.

In summary, the lower performance of the GNN model across different data sets can be attributed to various factors. For our hospital activity data set, the GNN model struggles to effectively differentiate certain activities due to their environmental similarities. Similarly, the Stanford 40 and PASCAL VOC 2012 data sets pose challenges because they often feature images with a single human figure and minimal background context, which deprives the scene graph of crucial additional information for accurate classification. Moving forward, our work suggests potential avenues for future research. Exploring methods to improve activity differentiation in similar environments and enhancing the handling of images with minimal background information could lead to better performance of scene-graph based HAR models. Additionally, developing and integrating advanced techniques for interpreting and contextualizing human figures and their actions in scene graphs holds promise for future research. Despite some limitations, incorporating scene graph analysis into activity recognition presents a promising research direction that offers a structured approach to understanding and interpreting complex visual data. With further refinement and development, these techniques have the potential to significantly advance the field of automated activity recognition.

5.3 Model performance under varied lighting conditions

To assess the robustness of our HAR system, it is crucial to evaluate its sensitivity to varying lighting conditions. Hence, we extended our evaluation to examine the impact of lighting on our proposed model. To assess the model’s performance in different lighting scenarios, we conducted tests on the first split of our data set, which was originally one of five random splits used to calculate the mean performance value. This split was modified with gamma correction, which changes the pixel values of an image using a power-law expression (Eq. (12)) to simulate different lighting conditions.

Iout= I inγ

Where Iin is the input value (ranges from 0 to 1), I ou t is the adjusted value, and γ represents the gamma value. If the gamma value is less than 1, the image becomes lighter, while the image becomes darker if the gamma value is greater than 1. Fig.11 shows the example test images under different gamma values.

The results, presented in Tab.9, demonstrate the confusion matrix of our system’s performance across various altered scenarios. The confusion matrix provides valuable insight into the model’s accuracy in processing images affected by different gamma adjustments, thereby simulating a wide range of realistic lighting environments. By systematically modifying the gamma settings and analyzing the resulting changes in the confusion matrix, we can derive conclusions about the model’s robustness to lighting variations. This experimental setup enables us to identify any potential degradation in recognition performance caused by changes in image brightness and contrast, both of which are critical factors for practical applications in diverse and dynamic real-world conditions.

5.4 Future work

This study has made significant advancements in HAR for healthcare applications. However, there are several areas for future research that could enhance system performance. One such area is the refinement of scene graph extraction. Currently, our scene graphs contain unnecessary details that may impact the performance of the GNN. Developing a more sophisticated method to filter out these details could enhance the accuracy and speed of our system. Additionally, expanding our existing graph classification network to a multi-layer network could improve our ability to recognize activities. However, this expansion may give rise to challenges such as nodes becoming too similar and a learning bottleneck resulting from excessive information compression (Li et al., 2018, Alon and Yahav, 2020). Therefore, future work could focus on addressing these issues to strike a balance between network depth and learning efficiency.

In conclusion, addressing the challenges of variation and noise in images used for HAR is crucial. To tackle these challenges, we propose adapting techniques from other fields, such as the Augmented Linear Mixing Model (Hong et al., 2019), and applying them to enhance the stability of recognition algorithms. Testing and implementing our proposed methods in real-world, uncontrolled environments, particularly in healthcare settings, could lead to more efficient procedures, including cleaning and disinfecting. To validate and refine our method for live environments, we emphasize the importance of enhancing our data set with real images from healthcare settings. Additionally, we are exploring methods for robots to accurately determine their location (Aggarwal, 2020a), using a variety of evaluation measures (Aggarwal, 2020b, Xiao et al., 2023). Implementing these systems in real-time within healthcare facilities will provide practical insights and opportunities for improvement. To enhance computational efficiency, we propose applying optimization techniques such as structured pruning, model quantization, and knowledge distillation to our Swin Transformer and CNN models. These techniques have the potential to reduce computational load without significant performance loss. Furthermore, we suggest utilizing sparse computations for graph analysis, Fast Fourier Transforms, transfer learning, and specialized hardware such as GPUs and TPUs to improve efficiency. In summary, our research aims to advance HAR systems, making them more precise, efficient, and applicable in various settings, particularly in complex environments like healthcare facilities.

6 Conclusions

In this study, we present a composite decision fusion strategy for HAR that effectively integrates information from scene graphs and visual data. We have created a comprehensive hospital activity data set comprising 5,770 images, including 25 diverse categories of activities. To analyze these images, we have applied a technique to transform them into scene graphs and utilized a GNN for accurate classification. Simultaneously, we have utilized two advanced deep learning architectures, namely Swin_L and ConvNeXt_L, for visual feature-based HAR. Our decision fusion method combines the DST with a weighted majority vote to consolidate the decisions obtained from the GNN, Swin_L, and ConvNeXt_L models. This approach has yielded an overall accuracy of 90.59%, with a precision of 90.31% and similar metrics for F1 score and recall. However, the significance of our approach goes beyond these performance metrics. By accurately recognizing human activities, our system can be seamlessly integrated with existing medical interventions, forming a comprehensive strategy to mitigate pathogen transmission in healthcare facilities. This integration enables enhanced infection control measures by providing healthcare professionals with detailed insights into patient and staff movements and interactions. Ultimately, our approach has the potential to play a crucial role in reducing the spread of infections and improving patient safety in healthcare environments, thereby demonstrating its extensive practical implications.

Appendix

References

[1]

Aggarwal A K, (2020a). Enhancement of GPS position accuracy using machine vision and deep learning techniques. Journal of Computational Science, 16( 5): 651–659

[2]

AggarwalA K (2020b). Fusion and enhancement techniques for processing of multispectral images. In: Ran A & Teiji W (Eds.), Unmanned Aerial Vehicle: Applications in Agriculture and Environment Cham: Springer International Publishing. 159–175

[3]

AlonUYahav E (2020). On the bottleneck of graph neural networks and its practical implications. ArXiv: 2006.05205

[4]

Assadian O, Harbarth S, Vos M, Knobloch JK, Asensio A, Widmer AF, (2021). Practical recommendations for routine cleaning and disinfection procedures in healthcare institutions: A narrative review. Journal of Hospital Infection, 113: 104–114

[5]

Bas C, Ikizler-Cinbis N, (2022). Top-down and bottom-up attentional multiple instance learning for still image action recognition. Signal Processing Image Communication, 104: 116664

[6]

Bera A, Wharton Z, Liu Y, Bessis N, Behera A, (2021). Attend and guide (AG-Net): A keypoints-driven attention-based deep network for image recognition. IEEE Transactions on Image Processing, 30: 3691–3704

[7]

BrodySAlon UYahavE (2021). How attentive are graph attention networks? 2022 International Conference on Learning Representations (ICLR)

[8]

CDC (2024). [Internet]. Available from the wobsit of CDC

[9]

ChenKZhang DYaoLGuoBYuZ LiuY (2021). Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput. Surv., 54: 77:1–77:40

[10]

DanielJLauffenburger J P (2011). Conflict management in multi-sensor dempster-shafer fusion for speed limit determination. 2011 IEEE Intelligent Vehicles Symposium (IV). p. 987–992.

[11]

Ding Y, Zhang Z, Zhao X, Hong D, Cai W, Yang N, Wang B, (2023). Multi-scale receptive fields: Graph attention neural network for hyperspectral image classification. Expert Systems with Applications, 223: 119858

[12]

Ding Y, Zhang Z, Zhao X, Hong D, Cai W, Yu C, Yang N, Cai W, (2022). Multi-feature fusion: Graph neural network and CNN combining for hyperspectral image classification. Neurocomputing, 501: 246–257

[13]

Dong E, Du H, Gardner L, (2020). An interactive web-based dashboard to track COVID-19 in real time. Lancet. Infectious Diseases, 20( 5): 533–534

[14]

FengWZhang XHuangXLuoZ (2017). Attention focused spatial pyramid pooling for boxless action recognition in still images. In: Lintas A, Rovetta S, Verschure PFMJ, and Villa AEP, eds. Artificial Neural Networks and Machine Learning – ICANN 2017 Cham: Springer International Publishing. 574–581

[15]

GkioxariGGirshick RMalikJ (2015a). Actions and attributes from wholes and parts. In: 2015 IEEE International Conference on Computer Vision (ICCV). 2470–2478

[16]

GkioxariGGirshick RMalikJ (2015b). Contextual action recognition with R* CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV). 1080–1088

[17]

Guettari M, Gharbi I, Hamza S, (2021). UVC disinfection robot. Environmental Science and Pollution Research International, 28( 30): 40394–40399

[18]

Guo G, Lai A, (2014). A survey on still image based human action recognition. Pattern Recognition, 47( 10): 3343–3361

[19]

Guo L, Wang L, Liu J, Zhou W, Lu B, (2018). HuAc: Human activity recognition using crowdsourced WIFI signals and skeleton data. Wireless Communications and Mobile Computing, 2018: 1–15

[20]

Haque M, Sartelli M, McKimm J, Abu Bakar M B, (2018). Health care-associated infections—An overview. Infection and Drug Resistance, 11: 2321–2333

[21]

Hong D, Gao L, Yao J, Zhang B, Plaza A, Chanussot J, (2021). Graph convolutional networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59( 7): 5966–5978

[22]

Hong D, Yokoya N, Chanussot J, Zhu X, (2019). An augmented linear mixing model to address spectral variability for hyperspectral unmixing. IEEE Transactions on Image Processing, 28: 1923–1938

[23]

Hu D, Li S, (2022). Recognizing object surface materials to adapt robotic disinfection in infrastructure facilities. Computer-Aided Civil and Infrastructure Engineering, 37( 12): 1521–1546

[24]

Hu D, Li S, Wang M, (2023). Object detection in hospital facilities: A comprehensive dataset and performance evaluation. Engineering Applications of Artificial Intelligence, 123: 106223

[25]

Hu D, Zhong H, Li S, Tan J, He Q, (2020). Segmenting areas of potential contamination for adaptive robotic disinfection in built environments. Building and Environment, 184: 107226

[26]

IkizlerNCinbis R GPehlivanSDuyguluP (2008). Recognizing actions from still images. In: 2008 19th International Conference on Pattern Recognition (ICPR). 1–4

[27]

JinTGuoF MengQZhu SXiXWangWMuZ SongW (2023). Fast contextual scene graph generation with unbiased context augmentation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6302–6311

[28]

Khaire P, Kumar P, Imran J, (2018). Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognition Letters, 115: 107–116

[29]

Khan F S, Xu J, Van De Weijer J, Bagdanov A D, Anwer R M, Lopez A M, (2015). Recognizing actions through action-specific person detection. IEEE Transactions on Image Processing, 24( 11): 4422–4432

[30]

Kobak D, Berens P, (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10( 1): 5416

[31]

Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S, Fei-Fei L, (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123( 1): 32–73

[32]

Li Q, Han Z, Wu X, (2018). Deeper insights into graph convolutional networks for semi-supervised learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32( 1): 3538–3545

[33]

LiuZLinY CaoYHuH WeiYZhangZ LinSGuoB (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 9992–10002

[34]

LiuZMaoH WuC YFeichtenhofer CDarrellTXieS (2022). A ConvNet for the 2020s. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11966–11976

[35]

Dang L M, Min K, Wang H, Piran M J, Lee C H, Moon H, (2020). Sensor-based and vision-based human activity recognition: A comprehensive survey. Pattern Recognition, 108: 107561

[36]

Mudiyanselage S E, Nguyen P H D, Rajabi M S, Akhavian R, (2021). Automated workers’ ergonomic risk assessment in manual material handling using semg wearable sensors and machine learning. Electronics, 10( 20): 2558

[37]

OquabMBottou LLaptevISivicJ (2014). Learning and transferring mid-level image representations using convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1717–1724

[38]

PaszkeAGross SMassaFLererABradburyJ ChananGKilleen TLinZGimelsheinNAntigaL DesmaisonAKopf AYangEDeVitoZRaisonM TejaniAChilamkurthy SSteinerBFangLBaiJ ChintalaS (2019). PyTorch: An imperative style, high-performance deep learning library

[39]

PowersD M W (2010). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. ArXiv

[40]

Rashidi Nasab A, Elzarka H, (2023). Optimizing machine learning algorithms for improving prediction of bridge deck deterioration: A case study of ohio bridges. Buildings, 13( 6): 1517

[41]

Raza Usmani A, Kotowski S E, Davis K G, (2023). The impact of hospital bed height and gender on fall risk during bed egress. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 67( 1): 2434–2436

[42]

Rodríguez-Moreno I, Martínez-Otzeta J M, Sierra B, Rodriguez I, Jauregi E, (2019). Video activity recognition: State-of-the-Art. Sensors, 19( 14): 3160

[43]

RogovaG (2008). Combining the results of several neural network classifiers. In: RR Yager and L Liu, editor. Classic Works of the Dempster-Shafer Theory of Belief Functions Berlin, Heidelberg: Springer Berlin Heidelberg. p. 683–692

[44]

RuderS (2017). An overview of gradient descent optimization algorithms. ArXiv

[45]

Rutala W A, Weber D J, (2016). Monitoring and improving the effectiveness of surface cleaning and disinfection. American Journal of Infection Control, 44( 5): e69–e76

[46]

Singh R, Khurana R, Kushwaha A K S, Srivastava R, (2020). Combining CNN streams of dynamic image and depth data for action recognition. Multimedia Systems, 26( 3): 313–322

[47]

Siyal A R, Bhutto Z, Muhammad S, Iqbal A, Mehmood F, Hussain A, Ahmed S, (2020). Still image-based human activity recognition with deep representations and residual learning. International Journal of Advanced Computer Science and Applications, 11( 5): 471–477

[48]

TangKNiu YHuangJShiJZhangH (2020). Unbiased scene graph generation from biased training. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3713–3722

[49]

TangKZhang HWuBLuoWLiuW (2019). Learning to compose dynamic tree structures for visual contexts. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6612–6621

[50]

Uyguroğlu F, ToygarÖ Demirel H, (2024). CNN-based Alzheimer’s disease classification using fusion of multiple 3D angular orientations. Signal, Image and Video Processing, 18( 3): 2743–2751

[51]

Maaten L van der, Hinton G, (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9: 2579–2605

[52]

VeličkovićPCucurullGCasanovaA RomeroALiò PBengioY (2017). Graph attention networks

[53]

WangYJiang HDrewM SLiZ NMoriG (2006). Unsupervised discovery of action classes. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1654–1661

[54]

Xiao J, Aggarwal A K, Rage U K, Katiyar V, Avtar R, (2023). Deep learning-based spatiotemporal fusion of unmanned aerial vehicle and satellite reflectance images for crop monitoring. IEEE Access: Practical Innovations, Open Solutions, 11: 85600–85614

[55]

XuDZhuY ChoyC BFei-Fei L (2017). Scene graph generation by iterative message passing. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition. 3097–3106.

[56]

Yan S, Smith J S, Lu W, Zhang B, (2018). Multibranch attention networks for action recognition in still images. IEEE Transactions on Cognitive and Developmental Systems, 10( 4): 1116–1125

[57]

YaoBJiangX KhoslaALin A LGuibasLFei-FeiL (2011). Human action recognition by learning bases of action attributes and parts. In: 2011 IEEE International Conference on Computer Vision. 1331–1338

[58]

Yao J, Cao X, Hong D, Wu X, Meng D, Chanussot J, Xu Z, (2022). Semi-Active convolutional neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 99: 1–1

[59]

YunYGuI Y H AghajanH (2013). Riemannian manifold-based support vector machine for human activity classification in images. In: 2013 20th IEEE International Conference on Image Processing. 3466–3469

[60]

ZellersRYatskar MThomsonSChoiY (2018). Neural motifs: Scene graph parsing with global context. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5831–5840

[61]

Zemmar A, Lozano A M, Nelson B J, (2020). The rise of robots in surgical environments during COVID-19. Nature Machine Intelligence, 2( 10): 566–572

[62]

Zhang H B, Lei Q, Zhong B N, Du J X, Peng J, (2016). A survey on human pose estimation. Intelligent Automation & Soft Computing, 22( 3): 483–489

[63]

Zhao Z, Ma H, Chen X, (2016). Semantic parts based top-down pyramid for action recognition. Pattern Recognition Letters, 84: 134–141

[64]

Zhao Z, Ma H, Chen X, (2017a). Generalized symmetric pair model for action classification in still images. Pattern Recognition, 64: 347–360

[65]

ZhaoZMa HYouS (2017b). Single image action recognition using semantic body part actions. In: 2017 IEEE International Conference on Computer Vision. 3411–3419

[66]

Zheng X, Gong T, Lu X, Li X, (2022). Human action recognition by multiple spatial clues network. Neurocomputing, 483: 10–21

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (4245KB)

1044

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/