RESEARCH ARTICLE

Weakly supervised action anticipation without object annotations

  • Yi ZHONG 1 ,
  • Jia-Hui PAN 1 ,
  • Haoxin LI 1 ,
  • Wei-Shi ZHENG , 1,2
Expand
  • 1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
  • 2. Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Guangzhou 510006, China

Received date: 10 Apr 2021

Accepted date: 18 Nov 2021

Published date: 15 Apr 2023

Copyright

2023 Higher Education Press

Abstract

Anticipating future actions without observing any partial videos of future actions plays an important role in action prediction and is also a challenging task. To obtain abundant information for action anticipation, some methods integrate multimodal contexts, including scene object labels. However, extensively labelling each frame in video datasets requires considerable effort. In this paper, we develop a weakly supervised method that integrates global motion and local fine-grained features from current action videos to predict next action label without the need for specific scene context labels. Specifically, we extract diverse types of local features with weakly supervised learning, including object appearance and human pose representations without ground truth. Moreover, we construct a graph convolutional network for exploiting the inherent relationships of humans and objects under present incidents. We evaluate the proposed model on two datasets, the MPII-Cooking dataset and the EPIC-Kitchens dataset, and we demonstrate the generalizability and effectiveness of our approach for action anticipation.

Cite this article

Yi ZHONG , Jia-Hui PAN , Haoxin LI , Wei-Shi ZHENG . Weakly supervised action anticipation without object annotations[J]. Frontiers of Computer Science, 2023 , 17(2) : 172313 . DOI: 10.1007/s11704-022-1167-9

1 Introduction

Human action anticipation has experienced substantial development in recent years and has received increasing attention because of its importance and promising applications in the real world. Most of the early developed methods for anticipation tasks mainly focus on observing partial actions and then predicting the complete action label. Afterwards, several methods aim to predict the future action happens in a future specific time, such as one or two seconds later. But the action would not change too much one or two seconds later. Therefore, some approaches have a different purpose and anticipate the next different action, followed by the current type of action and not occurred in observations. Nevertheless, in this case, the task is more difficult because the information about future actions is unavailable. To solve this problem, there are several recent approaches [1,2] that aim to utilize previous action motions and scene context cues. In addition, existing methods show the potential of scene context, including object labels, for the anticipation task.
Unfortunately, fine-grained labelling requires considerable effort, such as labelling what and where the object appears in frames, and action anticipation requires a large-scale video repository. To address this problem, we develop a weakly supervised modelling approach. We aim to utilize existing object detection algorithms to dig out scene context cues, including object proposals and human pose. Unlike the supervised learning setting, we cannot confirm the accuracy of the detected objects for the lack of ground truth. However, our framework can still use these potential features to find out more context cues such as the relations between humans and nearby objects. Therefore, our weakly supervised learning method can reduce a lot of elaborated labelling efforts and can further be widely applied in large scale database which only contain action labels.
Our proposed framework is composed of several branches. We first apply 3D-CNN on action videos to obtain global action motions and only take optical flow frames of the current action as input. Next, we take advantage of more local fine-grained information in the current action. Compared to the method [1] that uses context labels directly, such as the tools, ingredients and containers that appeared in the action, as shown in Fig.1, we exploit context features in a weakly supervised manner without using any ground truth of context cues. Fortunately, there are many reliable algorithms for extracting fine-grained features in frames. Considering the appearance modality, we utilize object proposal appearance features detected by efficient object detection algorithms, and then we construct a graph convolutional network to discover their internal relationship, including but not limited to human-object interactions. Furthermore, note that our approach focuses on human action analysis, and we intuitively utilize human pose information. Therefore, we further integrate skeleton features detected by existing pose estimation algorithms in each frame. Finally, we fuse the global motion features and local fine-grained features to anticipate future behaviour.
Fig.1 Weakly supervised learning setting without object annotations. The left model [1] is under strong supervised learning setting, integrating context labels directly, such as the tools, ingredients and containers observed in the action. The right model, our weakly supervised learning setting, utilizes context cues without elaborated labels

Full size|PPT slide

We evaluate the proposed model on the MPII-Cooking dataset [3] and the EPIC-Kitchens dataset [4]. The experimental results show that our model achieves competitive performance. Note that the MPII-Cooking dataset is a third-person view video set, but the EPIC-Kitchens dataset is a first-person view video set. This difference in datasets shows that our model can be widely applied in anticipation tasks, not only in the third-person views but also in the first-person views.
In summary, our main contribution is that we develop a novel model for action anticipation that integrates global motion and local fine-grained features from current action. In particular, we extract various types of local features in a weakly supervised way, which has important applications in the real world. Furthermore, we demonstrate that the proposed model is appropriate for different person views and achieves good performance.

2 Related work

Early action prediction Research on human action prediction has made great progress in recent decades. Many methods predict actions from partially observed videos, which is called early action prediction. Many traditional methods focus on designing hand-crafted features for action prediction [59], such as integral bag-of-words (IBoW) and dynamic bag-of-words (DBoW) [5]; and max-margin action prediction machine [7], which formulates the prediction task as a structured SVM learning problem and incorporates composite kernels to capture nonlinear classification boundaries. A hierarchical framework [8] was employed to describe human movements at multiple levels of granularity. Recently, deep neural network representation has also been pursued for this task [1014], such as C3D [10] and I3D [11] that developed from the 2D convolution network to simultaneously extract spatial and temporal features. Furthermore, Zolfaghari et al. [15] integrated both 2D convolution network and 3D convolution network to predict action. Singh et al. [16] fused the object detector outputs from the appearance and optical flow modalities to locate the action and predict it. Kong et al. [12] exploited rich sequential context features to improve the representation of partial actions and further projected to full video information. Qin et al. [13] reconstructed the full video features transformed by the deep learning features of partial actions and learned compact binary codes for prediction. Lai et al. [17] designed a global-local distance model to learn the distance between subsequences both as a whole and individually. Kong et al. [18] proposed a memory module to record several samples that are hard to predict, aiming to optimize the prediction performance at the beginning stage of actions.
Action anticipation Recently, many methods for prediction tasks aim to predict the future action happens in a future specific time, such as one or two seconds later [1921]. Vondrick et al. [19] anticipated future action representations from a single frame one second before with a deep convolutional network. However, the model in [19] cannot embed the temporal information. And Zhong [20] further considered observing multiple frames to embed the temporal information. Furnari and Farinella [21] considered multiply modalities to form a modality attention mechanism and proposed two LSTMs to summarize the past and predict the future. Reinforcement learning has also been embedded into the prediction task. Gao et al. [22] utilized the encoder-decoder framework for future anticipation with reinforced learning for faster prediction. Zeng et al. [23] formulated the observation as a state and the future transformation as a policy for visual forecasting.
However, the action would not change too much one or two seconds later. Therefore, some other methods have a different purpose and they aim to predict the next different action followed by the current type of action. Mahmud et al. [1,2] integrated three consecutive actions by LSTM to infer the next action, and they adopted scene context, including objects that appeared in the scene. Ng and Fernando [24] utilized GRU encoder-decoder architecture and optimal transport loss to predict unseen action. Pirri et al. [25] proposed an internal state memory module to store the previous action states and integrated a Prototypical Network [26] to obtain context embedding. And the method not only needed action labels but also activity labels, which represent a series of actions. But our approach only makes use of action labels. Farha et al [27] made long-term predictions with RNN-based and CNN-based methods, which encode previous action labels as sequence features and further generate a future sequence coding. Ke et al. [28] also focused on action label modelling, but they also considered time information such that they can predict a future action at any specified moment from one shot. Wu et al. [29] proposed an on-wrist motion-triggered sensing system for anticipating future intentions, including the recurrent neural network and the policy network. Sun et al. [30] adopted the recurrent graph model in a weakly supervised way. However, they focused on multi-person action forecasting and considered each person as a graph node.
Relation learning Visual relationships [3133] play a quite important role in action tasks, including human-object relations, human-human relations, and object-object relations. Zhang et al. [31] detected proposal bounding boxes by region proposal networks (RPNs) and used pairs of related regions to train a relationship proposer. Hu et al. [32] proposed an object relation module to consider the relations between one proposal and all other proposals by an attention strategy. Gkioxari et al. [33] presented a human-centric model based on an object detection framework to detect the human-object interactions in images. Recently, graph models [3438] have been used in relation tasks. Kato et al. [35] used an external knowledge graph and graph convolutional networks (GCNs) to address the zero-shot problem that detected an unseen combination of interactions. Similarly, Qi et al. [37] introduced a graph parsing network that includes various functions and iteratively computed the adjacency matrices and node labels that represented the interactions between humans and objects. For videos, Wang and Gupta [36] further considered the temporal shape dynamics and proposed space-time region graphs to represent video actions.
The human pose skeleton is also useful in action prediction tasks [3943]. Given a single RGB image, Chao et al. [39] generated a sequence of human body poses with an encoder-decoder network and RNN network. These two networks were also applied as multi-frame input for future generations [42,41]. Li et al. [40] presented a hierarchical structure of CNN to predict short-term and long-term human motion.

3 Methodology

In this section, we introduce our approach in detail. Compared to previous methods [1921] that predict the future action appears one or two seconds later, our approach has a different purpose and anticipates the next different action, which may happen a few seconds/minutes later. And unlike the method in [1,2], we have no knowledge about what objects, such as tools, ingredients and containers, would appear with human in our approach, as shown in Fig.1. That is, the scene context in [1,2] is unavailable here, and we will use a weakly supervised approach to anticipate future action. Our framework consists of a global motion branch and a local fine-grained branch, as shown in Fig.2.
Fig.2 Overview of our framework. Our model is divided into two types of part models for different capacities. The upper part is Global Motion branch. The upper of the local feature extractor is Appearance Relation branch and the other one is Human Skeleton branch

Full size|PPT slide

Fig.3 Sample frames of two datasets. The left frames are from the MPII-Cooking dataset, and the right frames are from the EPIC-Kitchens dataset

Full size|PPT slide

3.1 Global motion learning

Global feature learning always plays an important role in action tasks. The motion features from observed videos help us to evaluate the present motion, which can indicate how the motion will progress in the future under different person views. Fortunately, deep learning technology provides a more convenient and efficient method to extract video features.
As demonstrated in [10,11], the 3D CNN architecture, consisting of multiple convolutional layers, can generate confident and robust high-level features. These features incorporate spatial and temporal contexts, which are useful for various action tasks, such as recognition, detection, and prediction. In our approach, we adopt the I3D [11] architecture, followed by a fully connected layer. Moreover, while observing an action clip X1:n={x1,x2,...,xn}, we take only its optical flow frames clip X1:noptical into the global model to obtain its motion feature
zglobal=fglobal(X1:noptical),
because it contains abundant information about global motion, where xi is the ith frame in the action clip.

3.2 Weakly supervised learning for local fine-grained features

According to many existing studies, global motion provides many credible and distinguishable features for action analysis. However, in some real scenarios, we need more local fine-grained cues about local scenes to enhance reliability.

3.2.1 Appearance relation learning

As mentioned above, we utilize only optical flow frames to obtain the motion feature without any appearance context. In action anticipation, the future is uncertain, and the appearance cues are useful for avoiding ambiguity. For example, if a person is washing an apple, he/she may eat the apple next time. However, if we discover a cutting board and knife near him/her, we prefer to predict that he/she will cut the apple in the future. From experimental studies, some methods [1,2] show that the prediction performance dramatically improves after integrating the scene context, which demonstrates the potential of scene context. However, the label and the location of objects are not available in the scene all the time. To overcome this problem, we decide to use a weakly supervised method, that is, to adopt an existing object detection algorithm to detect object proposals in frames, as shown in Fig.2. Furthermore, we will take advantage of the graph convolutional network to extract the underlying relationship between object proposals [34].
For frame xi in an action video clip X1:n, we calculate its object proposals {Oji}j=1:K by a detection algorithm, where K is the number of object proposals detected in a frame. However, the number of proposals, K, may not be the same between each frame, and there are many redundant proposals. Considering the computational efficiency and integration of effective information, we need to perform some selections and then obtain K proposals {O^ji}j=1:K for frame xi. In detail, between those object proposals, we can extract person bounding boxes by labels predicted from the detection algorithm and then select the top-1 person bounding boxes because we only consider scenes containing one person. Around the chosen people, we further select other K1 object proposal bounding boxes according to the spatial distance from the chosen person. Then, we will construct the appearance relational graph.
Modelling the nodes We crop the K proposals in the stream frame and place these images into a deep convolutional architecture to obtain proposal appearance features, hi,(0)={h1i,(0),h2i,(0),...,hKi,(0)}, where hji,(0)Rd represents the jth proposal feature in frame xi at the initial step, which will be simplified to hj in the following statement, and d is the feature dimension. We consider the chosen person and object proposals to be the nodes and their appearance features to be the node features.
Modelling the edges As in [34], we apply a shared linear transformation matrix WRd~×d on each node to obtain a high-level feature, where d~ is a new dimension of node features after the graph network. Then, regarding node i and node j, we combine their node features to generate edge features:
eij=Concat(Whi,Whj)R2d~.
Updating the graph In this step, the main operation is to update the node states. Clearly, human-object relations are crucial for forecasting the future. However, the object-object relations can also provide potential clues. Consequently, while updating node i, our approach takes all the other nodes into consideration by combining neighbours with an attention scheme. We compute the attention coefficients related to node i through a Softmax layer with the LeakyReLU function:
αij=exp(LeakyReLU(aTeij))j=1Kexp(LeakyReLU(aTeij)),
where i,j[1:K] represent the node indices, a is a parameter vector with the same dimension as eij and T indicates the transpose. We can obtain the attention weight αij(0,1), which determines the correlation between node j and node i. Furthermore, the updated state of node i of frame xi can be calculated by the following weighting fusion:
hii,(1)=j=1Kαijhji,(0).
After updating M iterations, our relational graph generates the latest node states hi,(M)={h1i,(M),h2i,(M),..., hK,i(M)} from hi,(0) as above. Next, we combine the K node features through a fusion function fgraph_fus() at each graph frame. We next choose frame-level feature fusion to obtain the appearance relation feature:
zrelate=fframe_fus({fgraph_fus(hi,(M))}i=1:P).
After the feature fusion in each graph frame, we obtain a frame-level feature of the appearance relation. Note that we select P frames equidistantly in each action clip. In brief, we construct the same output dimension of fgraph_fus() and fframe_fus() as that of hi,(M).

3.2.2 Human skeleton feature

We already described the global motion above. However, we occasionally obtain blurry motion within a large field, which makes the future more uncertain. To reduce the impact, in the third-person view, we can take the human skeleton into consideration, as shown in Fig.2. In recent years, skeleton detection algorithms have achieved surprising performance, and we will apply such an algorithm in our framework.
For frame xi, we can detect the human skeleton SiRr×2 with a confidence score u, where r is the number of key points, and each key point consists of x-location, y-location. Note that our local fine-grained method is a weakly supervised approach, namely, no ground truth about skeleton points in the dataset is available. There is a problem that we cannot recognize partial skeletons due to shelters or illumination. Therefore, in our dataset, we remain the skeleton frame whose confidence score u is larger than a threshold value δ and adopt the upper body skeleton to avoid occlusion because a person generally stands behind a table, hiding his/her lower body.
Consequently, we obtain a new skeleton clip {S^i|S^i Rr×2}i=1:n from video clip X1:n, where r is the number of upper body key points. Then, our approach fuses the following pose information:
zpose=fpos_fus(Concat(Sample({S^i}))).
Notably, we utilize a sampling operation on the remaining skeleton frame, selecting P^ frames equidistantly for each action sample. Then, we concatenate their coordinates into vectors, followed by a fusion function. We integrate all local features:
zlocal=Concat(zrelate,zpose).

3.3 Overall

Finally, we integrate global motion and the local fine-grained method to construct the entire framework, as shown in Fig.2. We further fuse the previous features, followed by a softmax layer to obtain the predicted output:
zpred=Softmax(ffus(Concat(zglobal,zlocal))).
The goal of our framework is to forecast the future next action y, and our training loss is formulated as follows:
l=1Ni=1Nj=1cyj(i)log(zpred(i)),
where N is the number of training sets. yj(i)=1 if the ground-truth future is j; otherwise, yj(i)=0.

4 Experiments

4.1 Datasets

The MPII-Cooking dataset [3] is a dataset of fine-grained cooking activities, and it contains 65 different cooking categories, such as wash hands, close drawer, cut part and so on, recorded by 12 participants, as shown in Fig.3. It has 44 videos with a total length of more than 8 hours and a total of 5,609 annotations for action segments. By following the setting in [1], we also adopt a 5-fold leave-one-person-out cross-validation method. Specifically, in each training instance, we select one subject from 4 test participants to be the testing set, leaving the other 11 participants for training.
The EPIC-Kitchens dataset [4] is a large-scale egocentric video dataset recorded by 32 participants in different native kitchens from 4 cities, and they capture only one-person activities. There are 39,564 action segments with a total length of 55 hours. This dataset contains 125 verb classes, 331 noun classes and 2,513 classes of actions that are constructed from verb-noun combinations. Since annotations of the test set are not available, we focus on 272 training videos containing 28,561 action segments and split them into training and testing sets as in [21].

4.2 Implementation details

As mentioned above, our global motion feature extractor fglobal() consists of an I3D architecture, followed by a fully-connected layer with 128 hidden nodes, and its input is an optical frame clip that already exists in both datasets. Regarding I3D, we adopt the Inception-v1 I3D models trained on the Kinetics dataset training split. We further apply the YOLO-v3 [44] model trained on COCO to extract object proposals. After cropping the proposals, we use ResNet-101 pre-trained on ImageNet to obtain node initial features, whose dimension d is 2,048. For the graph model, we utilize the single-head GAT [34] network with a dropout of 0.2 and an alpha of 0.2. Moreover, we set the number of iteration M as 2 and hidden size d~ as 128. The sampled number P is 3, and we also use a 128-node fully connected layer to replace the frame-level fusion function, fframe_fus(). During training, we adopt the SGD optimizer with a momentum of 0.5. We train and test our framework in PyTorch and obtain an average of 20 round results. Since the datasets have different person views, the experimental settings between these datasets are slightly different.
The MPII-Cooking dataset is a third-person view dataset with one-person activities, and we can use the AlphaPose [45,46] model pre-trained on the COCO dataset to detect human pose skeletons. We obtained 17-point human skeletons ( r=17) with 13 points in the upper body ( r=13). The sampled number P^ is set to 10 and the threshold value is 6.0. Then, we replace the fusion function fpos_fus() with a 128-node fully connected layer. For the graph model, we set the proposal number K as 5 and concatenate all node features in a frame, followed by a 128-node fully connected layer as the graph fusion function fgraph_fus(). For the last fusion function, ffus(), we use another 128-node fully connected layer to replace it. During prediction model training, the learning rate is set to 1e–3, and the batch size is set to 32. We use a dropout layer with a coefficient of 0.2.
The EPIC-Kitchens dataset consists of different types of classes, and we only predict the verb and action for weakly supervised settings. For the graph model, we set the proposal number K as 3 and choose the latest node feature that represents a person proposal bounding box in a frame as the graph fusion way. While obtaining the outputs zpredv and zpreda in the last layer, we use two independent softmax layers for verb and action, respectively. In the same way, we can obtain two losses lv and la, and then we sum lv and la with a 1:2 weight coefficient. In other words, we double la and add it to lv to obtain the final loss. The EPIC-Kitchens dataset has many more classes than those in the MPII-Cooking dataset; thus, we extend the hidden nodes in ffus(), which is replaced with a 256-node fully connected layer. During prediction model training, we set the learning rate as 1e–2 and batch size as 128. The coefficient of the dropout layer is set to 0.5.

4.3 Experiment on MPII-Cooking dataset

4.3.1 Compared methods

We compared the results of the following methods:
- Contexts [1] This method incorporates different context attributes, such as sequential activity context, scene context and inter-activity time context. However, we drop the scene context branch for the weakly supervised task. It takes the bag-of-word-based motion boundary histograms (MBH) [47] as motion features and utilizes three previous actions to predict the next action by an LSTM framework.
- CNFAS [2] This method is an improved version of the previous method [1]. It adopts pre-trained convolutional 3D (C3D) [10] features to replace motion features. In addition, it pays more attention to future activity sequences. In the same way, we ignore the scene context branch.
- Contexts (I3D) Compared to the method Contexts above, we use three consecutive I3D features to replace MBH features in a sequential activity context branch. We also use the LSTM model to combine these optical flow I3D features and adopt the inter-activity time context branch. Considering the lack of object labels in our approach, we omit the scene context branch.

4.3.2 Results

The results on the MPII-Cooking dataset are shown in Tab.1 and we use two evaluation measures, top-1 accuracy and top-5 accuracy. The Contexts (I3D) method, which follows the same framework as Contexts [1] and CNFAS [2], obtains a few improvements in top-1 accuracy, with a 1.1 % improvement compared to the state-of-the-art method, likely because it adopts the I3D model. Similarly, the CNFAS [2] method that uses C3D features is better than Contexts [1], which contains MBH features. However, our proposed method performs superior to the state-of-the-art method on top-1 accuracy, with an over 5% improvement and up to 39.35% accuracy. We also observe that our fusion framework achieves an over 4% improvement compared to the Contexts (I3D) method. In terms of top-5 accuracy, our approach even improves 7.1% and reaches up to 66.28% accuracy. Previous methods adopt multiple continuous actions to obtain the evolution of actions; conversely, our approach considers a single action and explores fine-grained local information in a weakly supervised way. The results demonstrate the effectiveness of our framework.
Tab.1 Results on the MPII-Cooking dataset
Method Top-1 accuracy/% Top-5 accuracy/%
Contexts [1] 33.1
CNFAS [2] 33.7
Contexts (I3D) 34.82 59.11
Ours 39.67 66.12
However, there are still many works to do. A qualitative evaluation is shown in Fig.4. We can find that the similarity between some actions, such as cut part, cut slice and cut dice, will mislead our model, although we predict the true one in top-5 predicted labels. On the other hands, our model will take into consideration some unnecessary objects, such as background objects, which influences the anticipation.
Fig.4 Qualitative evaluation on MPII-Cooking Dataset. The top-5 predicted labels are shown on the right

Full size|PPT slide

4.3.3 Ablation study

We also perform multiple ablation experiments on the MPII-Cooking dataset to evaluate the effectiveness of each branch, and the results of the ablation study are shown in Tab.2. The I3D-GAT method and I3D-POSE method both outperform the branch I3D on both evaluation measures. And the local features, either appearance relation feature or human pose feature, are able to enhance the model efficiency. We hold the view that these object appearance relations or the human skeleton pose are different modalities from the global motions, and they can complement each other to obtain comprehensive features. We also observe that our full model performs better than both I3D-GAT and I3D-POSE on each evaluation measure. Consequently, we find that the object appearance relation and human skeleton pose cannot be replaced by each other. These two modalities have individually specific information that contributes to our prediction model.
Tab.2 Ablation study on the MPII-Cooking dataset
Components Top-1 accuracy/% Top-5 accuracy/%
I3D 34.49 60.29
POSE 37.35 64.27
GAT 37.37 60.74
I3D-POSE 37.43 64.13
I3D-GAT 37.71 65.02
GAT-POSE 38.42 64.11
Ours (full model) 39.67 66.12

4.4 Experiment on EPIC-Kitchens dataset

4.4.1 Compared methods

We compared the results of the following methods:
- I3D In this method, we only adopt the global motion branch, taking optical frame clips of the current action as input.
- Contexts (I3D) As mentioned above, we follow the framework in Contexts [1] and CNFAS [2], using three consecutive I3D features to replace MBH and C3D features. Note that we adopt the sequential activity context branch and the inter-activity time context branch for a weakly supervised setting about a lack of object labels.
- GAT We only use the relational graph branch in our framework.

4.4.2 Results

The results on the EPIC-Kitchens dataset are shown in Tab.3. On the first-person view dataset, our proposed method also performs superior to the other compared methods. Regarding the framework of Contexts in particular, we show that the single action has many available details, such as objects in the scene and human-object relations, and we can make good use of them for anticipating the future. Furthermore, we evaluate our model on EPIC-Kitchens test sets, and it also obtains good performance as shown in Tab.4. Considering the results of the previous dataset, our prediction framework is effective and competitive not only in the third-person view but also in the first-person view. The qualitative evaluation is shown in Fig.5. Our model can integrate the objects around the person to predict the next action. However, due to the lack of object annotations, we cannot confirm which object is crucial. Sometimes, we will focus on irrelevant objects and miss the relevant ones. On the other hand, the various possibilities of the future make the task more difficult, which prompts our model to dig out more discriminative information.
Tab.3 Results on the EPIC-Kitchens dataset
Method Top-1 accuracy Top-5 accuracy
VERB ACTION VERB ACTION
Contexts (I3D) 27.41 05.26 73.01 15.22
I3D 27.82 05.62 74.22 16.89
GAT 29.29 07.46 74.88 20.67
Ours 31.13 08.41 75.82 22.34
Tab.4 Results on the EPIC-Kitchens test sets
Method Top-1 accuracy Top-5 accuracy
VERB ACTION VERB ACTION
S1 Contexts (I3D) 27.40 04.96 70.57 15.09
I3D 28.42 05.59 72.23 16.20
GAT 29.89 06.33 74.36 19.16
Ours 31.27 08.13 74.86 21.54
S2 Contexts (I3D) 23.52 03.79 62.92 10.07
I3D 24.96 04.03 66.00 11.98
GAT 25.50 04.44 65.79 11.78
Ours 27.04 05.50 67.12 14.27
Fig.5 Qualitative evaluation on EPIC-Kitchens Dataset. The top-5 predicted labels are shown on the right

Full size|PPT slide

4.5 Effect of model settings

We conduct more experiments on our approach about different model settings such as frame number selections, feature fusion methods in the appearance relation graph branch, modality selection in the global motion branch and so on.

4.5.1 Effect of frame number selection on relation learning

We consider the impacts of different frame selection numbers on appearance relation graph learning. Generally, objects appearing on the screen will stay for a while. Therefore, instead of taking all frames to construct a graph, we sample P frames equidistantly for each action. Note that we only consider an odd number and guarantee that our model can always take the middle frame from actions. The results on two datasets are shown in Tab.5. On the MPII-Cooking dataset, we observe that our proposed model performs the best when we take three frames, which only contains the action beginning frame, middle frame and action ending frame. Sufficient action information may be lacking when P = 1; meanwhile, redundant cues occur when P = 5, 7 or 9. Although these input frames also contain the previous three action frames, it increases the complexity of the model; thus, the performance decreases. On the EPIC-Kitchens dataset, our model performs better overall with little inferiority on individual evaluation measures when we select three frames. When P = 5, 7 or 9, the result is close to the result obtained at P = 3. The reason could be that in the first-person view, the scene transforms rapidly even within one action. Therefore, even with the increase in complexity, more frames may bring some useful cues.
Tab.5 Results of different frame selections on two datasets. P: the number of selection frames on appearance relation graph learning
P MPII-Cooking EPIC-Kitchens
Top-1 Acc. Top-5 Acc. Top-1 Acc. Top-5 Acc.
VERB ACTION VERB ACTION
1 37.41 64.73 28.36 06.78 74.61 18.63
3 39.67 66.12 31.13 08.41 75.82 22.34
5 38.05 65.06 30.49 07.84 75.34 21.31
7 37.46 63.60 30.55 08.18 75.99 22.13
9 37.95 63.85 30.56 08.04 75.76 22.16

4.5.2 Effect of proposal number selection

We also consider the impacts of different proposal selection numbers on appearance relation graph learning. The results on two datasets are shown in Tab.6. On the EPIC-Kitchens dataset, we observe that our proposed model performs better when K = 3. However, on the MPII-Cooking dataset, our model performs the best when we take five proposals. The reason could be that on the MPII-Cooking dataset, the related proposals are farther away from the camera and look smaller in the picture as shown in Fig.4, and so we need to take more proposals. In addition, on both datasets, our graph-based method that utilizes graph performs better than the method with the concatenation of object features. The graph model helps us to obtain more appearance relation features about the actions.
Tab.6 Results of different proposal number selections on two datasets. K: the number of proposals in one frame on appearance relation graph learning. Obj: We concatenate proposal appearance features to replace the relational graph in each frame in our framework
K MPII-Cooking EPIC-Kitchens
Top-1 Acc. Top-5 Acc. Top-1 Acc. Top-5 Acc.
VERB ACTION VERB ACTION
3 37.64 65.34 31.13 08.41 75.82 22.34
5 39.67 66.12 30.27 07.68 75.68 21.10
7 38.20 66.00 30.64 08.06 75.78 21.76
Obj 38.55 65.42 29.46 06.77 74.88 18.79

4.5.3 Effect of modality selection

We consider the impacts of different modalities in the global motion branch. The results on two datasets are shown in Tab.7. We observe that, on both datasets, our proposed model performs better when we take use of optical flow modality which contains motion information, and the performance decreases in most cases while we fuse the RGB modality. The reason could be that when we straightforwardly take both modalities, the graph models for both modalities may contain similar information, which results in redundancy. Especially on the MPII-Cooking dataset, the actions only occupy a small portion of spatial in a frame as shown in Fig.4 and the RGB modality contains much background appearance. Note that our FLOW modality preforms better than RGB modality, because the FLOW modality also fuses the appearance information from graph models but the RGB modality lacks of motion information.
Tab.7 Results of different modality selections on two dataset. R: RGB modality. F: Optical Flow modality
M MPII-Cooking EPIC-Kitchens
Top-1 Acc. Top-5 Acc. Top-1 Acc. Top-5 Acc.
VERB ACTION VERB ACTION
R 38.65 65.36 30.24 08.03 76.07 22.05
F 39.67 66.12 31.13 08.41 75.82 22.34
R+F 38.58 65.69 30.47 08.45 75.90 22.15

4.5.4 Effect of fusion function selection

We can further consider the feature fusion functions in the appearance relation graph. For the graph feature fusion function fgraph_fus(), denoted as fg for brevity, we can concatenate all graph node features followed by a fully connected layer, denoted as FC. Meanwhile, we can choose only the latest node feature that represents a person proposal bounding box, denoted as Pe. Note that this operation only considers a person node, but the node feature has integrated other object features through graph updating. For the frame-level fusion function fframe_fus(), denoted as ff for brevity, we can also apply a fully connected layer on the concatenated frame features, denoted as FC. Moreover, we can use only mean pooling, denoted as Mean. The results on these two datasets are presented in Tab.8. On the MPII-Cooking dataset, we find that “Pe + Ma” performs slightly better than other combinations on top-5 accuracy, but “Fc + Fc” performs the best on top-1 accuracy. On the EPIC-Kitchens dataset, we observe that “Pe + Fc” preforms better in most cases. The reason could be that the hand proposal occupies usually occupies a large spatial portion in a frame as shown in Fig.5 and contains major information.
Tab.8 Results of different fusion methods on two datasets. Pe: node feature that represents person proposal bounding box. Me: mean pooling layer. Ma: max pooling layer. Fc: a fully connected layer. Pe+Me means that we choose node feature that represents a person proposal bounding box as the output of graph feature fusion function fgraph_fus() and apply a mean pooling layer as the frame-level fusion function fframe_fus()
fg+ ff MPII-Cooking EPIC-Kitchens
Top-1 Acc. Top-5 Acc. Top-1 Acc. Top-5 Acc.
VERB ACTION VERB ACTION
Pe+Me 38.86 66.40 28.83 08.43 75.66 21.77
Pe+Ma 39.02 66.48 28.95 07.55 74.90 20.58
Pe+Fc 38.95 65.73 31.13 08.41 75.82 22.34
Fc+Me 38.00 64.87 29.34 08.41 75.67 22.26
Fc+Ma 38.08 64.52 29.19 08.10 75.51 21.77
Fc+Fc 39.67 66.12 30.42 07.80 75.82 21.32

5 Conclusion

In this paper, we have presented an effective weakly supervised framework for anticipating future actions, considering not only global motion but also local specific features. While tremendous efforts for local context labels are required, we overcome this issue by a weakly supervised approach to extract local fine-grained information. Our model further constructs a graph convolutional network for human-object interaction to enrich the anticipation cues. We evaluate the proposed model on two challenging datasets, and the experimental results show its effectiveness under the weakly supervised setting. In the future, we plan to integrate more past actions to get more action sequence information to infer the next action and take a fixed length of observation as input to recognize current action and predict future action for more generalized application.

Acknowledgements

This work was supported partially by the National Natural Science Foundation of China (NSFC) (Grant Nos. U1911401 and U1811461), Guangdong NSF Project (2020B1515120085, 2018B030312002), Guangzhou Research Project (201902010037), Research Projects of Zhejiang Lab (2019KD0AB03), and the Key-Area Research and Development Program of Guangzhou (202007030004).
1
Mahmud T, Hasan M, Roy-Chowdhury A K. Joint prediction of activity labels and starting times in untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5784– 5793

2
Mahmud T, Billah M, Hasan M, Roy-Chowdhury A K. Captioning near-future activity sequences. 2019, arXiv preprint arXiv: 1908.00943

3
Rohrbach M, Amin S, Andriluka M, Schiele B. A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1194– 1201

4
Baradel F, Neverova N, Wolf C, Mille J, Mori G. Object level visual reasoning in videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 106– 122

5
Ryoo M S. Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2011, 1036– 1043

6
Xu Z, Qing L, Miao J. Activity auto-completion: predicting human activities from partial videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 3191– 3199

7
Kong Y, Fu Y . Max-margin action prediction machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38( 9): 1844– 1858

8
Lan T, Chen T C, Savarese S. A hierarchical representation for future action prediction. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 689– 704

9
Hu J F, Zheng W S, Ma L, Wang G, Lai J. Real-time RGB-D activity prediction by soft regression. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 280– 296

10
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489– 4497

11
Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724– 4733

12
Kong Y, Tao Z, Fu Y. Deep sequential context networks for action prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3662– 3670

13
Qin J, Liu L, Shao L, Ni B, Chen C, Shen F, Wang Y. Binary coding for partial action analysis with limited observation ratios. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6728– 6737

14
Lee D G, Lee S W . Prediction of partially observed human activity based on pre-trained deep representation. Pattern Recognition, 2019, 85: 198– 206

15
Zolfaghari M, Singh K, Brox T. ECO: efficient convolutional network for online video understanding. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 713– 730

16
Singh G, Saha S, Sapienza M, Torr P, Cuzzolin F. Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3657– 3666

17
Lai S, Zheng W S, Hu J F, Zhang J . Global-local temporal saliency action prediction. IEEE Transactions on Image Processing, 2018, 27( 5): 2272– 2285

18
Kong Y, Gao S, Sun B, Fu Y. Action prediction from videos via memorizing hard-to-predict samples. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 7000– 7007

19
Vondrick C, Pirsiavash H, Torralba A. Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 98– 106

20
Zhong Y, Zheng W S. Unsupervised learning for forecasting action representations. In: Proceedings of the 25th IEEE International Conference on Image Processing. 2018, 1073– 1077

21
Furnari A, Farinella G M. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 6251– 6260

22
Gao J, Yang Z, Nevatia R. RED: reinforced encoder-decoder networks for action anticipation. In: Proceedings of the British Machine Vision Conference. 2017

23
Zeng K H, Shen W B, Huang D A, Sun M, Niebles J C. Visual forecasting by imitating dynamics in natural sequences. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3018– 3027

24
Ng Y B, Fernando B. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. 2019, arXiv preprint arXiv: 1912.04608

25
Pirri F, Mauro L, Alati E, Ntouskos V, Izadpanahkakhk M, Omrani E. Anticipation and next action forecasting in video: an end-to-end model with memory. 2019, arXiv preprint arXiv: 1901.03728

26
Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017, 4080– 4090

27
Farha Y A, Richard A, Gall J. When will you do what?-Anticipating temporal occurrences of activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5343– 5352

28
Ke Q, Fritz M, Schiele B. Time-conditioned action anticipation in one shot. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 9917– 9926

29
Wu T Y, Chien T A, Chan C S, Hu C W, Sun M. Anticipating daily intention using on-wrist motion triggered sensing. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 48– 56

30
Sun C, Shrivastava A, Vondrick C, Sukthankar R, Murphy K, Schmid C. Relational action forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 273– 283

31
Zhang J, Elhoseiny M, Cohen S, Chang W, Elgammal A. Relationship proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 5226– 5234

32
Hu H, Gu J, Zhang Z, Dai J, Wei Y. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3588– 3597

33
Gkioxari G, Girshick R, Dollár P, He K. Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 8359– 8367

34
Veličković P, Casanova A, Lio P, Cucurull G, Romero A, Bengio Y. Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018

35
Kato K, Li Y, Gupta A. Compositional learning for human object interaction. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 247– 264

36
Wang X, Gupta A. Videos as space-time region graphs. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 413– 431

37
Qi S, Wang W, Jia B, Shen J, Zhu S C. Learning human-object interactions by graph parsing neural networks. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 407– 423

38
Zhang Q, Chang J, Meng G, Xu S, Xiang S, Pan C . Learning graph structure via graph convolutional networks. Pattern Recognition, 2019, 95: 308– 318

39
Chao Y W, Yang J, Price B, Cohen S, Deng J. Forecasting human dynamics from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3643– 3651

40
Li C, Zhang Z, Lee W S, Lee G H. Convolutional sequence to sequence model for human dynamics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5226– 5234

41
Martinez J, Black M J, Romero J. On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4674– 4683

42
Bütepage J, Black M J, Kragic D, Kjellström H. Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1591– 1599

43
Bloom V, Argyriou V, Makris D . Linear latent low dimensional space for online early action recognition and prediction. Pattern Recognition, 2017, 72: 532– 547

44
Redmon J, Farhadi A. YOLOv3: an incremental improvement. 2018, arXiv preprint arXiv: 1804.02767

45
Fang H S, Xie S, Tai Y W, Lu C. RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 2353– 2362

46
Xiu Y, Li J, Wang H, Fang Y, Lu C. Pose flow: efficient online pose tracking. In: Proceedings of the British Machine Vision Conference. 2018

47
Dalal N, Triggs B, Schmid C. Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European Conference on Computer Vision. 2006, 428– 441

Outlines

/