1 Introduction
Fig.1 Weakly supervised learning setting without object annotations. The left model [1] is under strong supervised learning setting, integrating context labels directly, such as the tools, ingredients and containers observed in the action. The right model, our weakly supervised learning setting, utilizes context cues without elaborated labels |
2 Related work
3 Methodology
3.1 Global motion learning
3.2 Weakly supervised learning for local fine-grained features
3.2.1 Appearance relation learning
3.2.2 Human skeleton feature
3.3 Overall
4 Experiments
4.1 Datasets
4.2 Implementation details
4.3 Experiment on MPII-Cooking dataset
4.3.1 Compared methods
4.3.2 Results
4.3.3 Ablation study
Tab.2 Ablation study on the MPII-Cooking dataset |
Components | Top-1 accuracy/% | Top-5 accuracy/% |
---|---|---|
I3D | 34.49 | 60.29 |
POSE | 37.35 | 64.27 |
GAT | 37.37 | 60.74 |
I3D-POSE | 37.43 | 64.13 |
I3D-GAT | 37.71 | 65.02 |
GAT-POSE | 38.42 | 64.11 |
Ours (full model) | 39.67 | 66.12 |
4.4 Experiment on EPIC-Kitchens dataset
4.4.1 Compared methods
4.4.2 Results
Tab.3 Results on the EPIC-Kitchens dataset |
Method | Top-1 accuracy | Top-5 accuracy | |||
---|---|---|---|---|---|
VERB | ACTION | VERB | ACTION | ||
Contexts (I3D) | 27.41 | 05.26 | 73.01 | 15.22 | |
I3D | 27.82 | 05.62 | 74.22 | 16.89 | |
GAT | 29.29 | 07.46 | 74.88 | 20.67 | |
Ours | 31.13 | 08.41 | 75.82 | 22.34 |
Tab.4 Results on the EPIC-Kitchens test sets |
Method | Top-1 accuracy | Top-5 accuracy | ||||
---|---|---|---|---|---|---|
VERB | ACTION | VERB | ACTION | |||
S1 | Contexts (I3D) | 27.40 | 04.96 | 70.57 | 15.09 | |
I3D | 28.42 | 05.59 | 72.23 | 16.20 | ||
GAT | 29.89 | 06.33 | 74.36 | 19.16 | ||
Ours | 31.27 | 08.13 | 74.86 | 21.54 | ||
S2 | Contexts (I3D) | 23.52 | 03.79 | 62.92 | 10.07 | |
I3D | 24.96 | 04.03 | 66.00 | 11.98 | ||
GAT | 25.50 | 04.44 | 65.79 | 11.78 | ||
Ours | 27.04 | 05.50 | 67.12 | 14.27 |
4.5 Effect of model settings
4.5.1 Effect of frame number selection on relation learning
Tab.5 Results of different frame selections on two datasets. P: the number of selection frames on appearance relation graph learning |
P | MPII-Cooking | EPIC-Kitchens | ||||||
---|---|---|---|---|---|---|---|---|
Top-1 Acc. | Top-5 Acc. | Top-1 Acc. | Top-5 Acc. | |||||
VERB | ACTION | VERB | ACTION | |||||
1 | 37.41 | 64.73 | 28.36 | 06.78 | 74.61 | 18.63 | ||
3 | 39.67 | 66.12 | 31.13 | 08.41 | 75.82 | 22.34 | ||
5 | 38.05 | 65.06 | 30.49 | 07.84 | 75.34 | 21.31 | ||
7 | 37.46 | 63.60 | 30.55 | 08.18 | 75.99 | 22.13 | ||
9 | 37.95 | 63.85 | 30.56 | 08.04 | 75.76 | 22.16 |
4.5.2 Effect of proposal number selection
Tab.6 Results of different proposal number selections on two datasets. K: the number of proposals in one frame on appearance relation graph learning. Obj: We concatenate proposal appearance features to replace the relational graph in each frame in our framework |
K | MPII-Cooking | EPIC-Kitchens | ||||||
---|---|---|---|---|---|---|---|---|
Top-1 Acc. | Top-5 Acc. | Top-1 Acc. | Top-5 Acc. | |||||
VERB | ACTION | VERB | ACTION | |||||
3 | 37.64 | 65.34 | 31.13 | 08.41 | 75.82 | 22.34 | ||
5 | 39.67 | 66.12 | 30.27 | 07.68 | 75.68 | 21.10 | ||
7 | 38.20 | 66.00 | 30.64 | 08.06 | 75.78 | 21.76 | ||
Obj | 38.55 | 65.42 | 29.46 | 06.77 | 74.88 | 18.79 |
4.5.3 Effect of modality selection
Tab.7 Results of different modality selections on two dataset. R: RGB modality. F: Optical Flow modality |
M | MPII-Cooking | EPIC-Kitchens | ||||||
---|---|---|---|---|---|---|---|---|
Top-1 Acc. | Top-5 Acc. | Top-1 Acc. | Top-5 Acc. | |||||
VERB | ACTION | VERB | ACTION | |||||
R | 38.65 | 65.36 | 30.24 | 08.03 | 76.07 | 22.05 | ||
F | 39.67 | 66.12 | 31.13 | 08.41 | 75.82 | 22.34 | ||
R+F | 38.58 | 65.69 | 30.47 | 08.45 | 75.90 | 22.15 |
4.5.4 Effect of fusion function selection
Tab.8 Results of different fusion methods on two datasets. Pe: node feature that represents person proposal bounding box. Me: mean pooling layer. Ma: max pooling layer. Fc: a fully connected layer. Pe+Me means that we choose node feature that represents a person proposal bounding box as the output of graph feature fusion function and apply a mean pooling layer as the frame-level fusion function |
+ | MPII-Cooking | EPIC-Kitchens | ||||||
---|---|---|---|---|---|---|---|---|
Top-1 Acc. | Top-5 Acc. | Top-1 Acc. | Top-5 Acc. | |||||
VERB | ACTION | VERB | ACTION | |||||
Pe+Me | 38.86 | 66.40 | 28.83 | 08.43 | 75.66 | 21.77 | ||
Pe+Ma | 39.02 | 66.48 | 28.95 | 07.55 | 74.90 | 20.58 | ||
Pe+Fc | 38.95 | 65.73 | 31.13 | 08.41 | 75.82 | 22.34 | ||
Fc+Me | 38.00 | 64.87 | 29.34 | 08.41 | 75.67 | 22.26 | ||
Fc+Ma | 38.08 | 64.52 | 29.19 | 08.10 | 75.51 | 21.77 | ||
Fc+Fc | 39.67 | 66.12 | 30.42 | 07.80 | 75.82 | 21.32 |