Duet of ViT and CNN: multi-scale dual-branch network for fine-grained image classification of marine organisms
Guangzhe Si , Zhaorui Gu , Haiyong Zheng
Intelligent Marine Technology and Systems ›› 2024, Vol. 2 ›› Issue (1)
Duet of ViT and CNN: multi-scale dual-branch network for fine-grained image classification of marine organisms
Fine-grained image classification of marine organisms involves dividing subcategories within a larger category. For instance, this could mean distinguishing specific species of fish or types of algae. This type of classification is more intricate than regular image classification, as the minor feature differences between subcategories are often concentrated in one or a few specific areas. Therefore, accurately identifying these critical regions and effectively using local features are crucial in improving the accuracy of fine-grained image classification. Existing methods for fine-grained image classification primarily rely on single-branch models based on convolutional neural networks (CNNs) or vision transformers (ViTs). Consequently, merging them allows for a more comprehensive understanding of marine organism images. In addition, marine organism images are affected by the distance and angle of the shot, making it challenging to capture detailed local nuances at a single scale. To address these challenges, we propose a multi-scale dual-branch network (MSDBN) that combines the strengths of ViT and CNN for fine-grained image classification of marine organisms. Our model uses a novel two-stage selection module to select discriminative regions from the ViT branch. Following this, the CNN branch executes a more detailed feature extraction on the local regions. To effectively utilise the multi-scale information of marine organisms, we introduce our designed multi-scale shift-window self-attention, specifically for the ViT branch. MSDBN demonstrates improved performance compared to existing classical methods and the best-performing dual-branch methods on three marine datasets. Our code is released publicly at https://github.com/Xiaosigz/MSDBN.
Fine-grained image classification / Marine organisms / Convolutional neural network / Vision transformer / Dual-branch / Multi-scale
| [1] |
Branson S, Van Horn G, Belongie S, Perona P (2014) Bird species categorization using pose normalized deep convolutional nets. Preprint at arXiv:1406.2952 |
| [2] |
|
| [3] |
Fu JL, Zheng HL, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp 4476–4484 |
| [4] |
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, pp 580–587 |
| [5] |
He KM, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), Venice, pp 2980–2988 |
| [6] |
He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, pp 770–778 |
| [7] |
He J, Chen JN, Liu S, Kortylewski A, Yang C, Bai YT et al (2022) TransFG: a transformer architecture for fine-grained recognition. In: AAAI Conference on Artificial Intelligence (AAAI), Vancouver, pp 852–860 |
| [8] |
|
| [9] |
Hu YQ, Jin X, Zhang Y, Hong HW, Zhang JF, He Y et al (2021) RAMS-Trans: recurrent attention multi-scale transformer for fine-grained image recognition. Preprint at arXiv:2107.08192 |
| [10] |
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp 2261–2269 |
| [11] |
Khosla A, Jayadevaprakash N, Yao BP, Li FF (2011) Novel dataset for fine-grained image categorization: stanford dogs. In: First Workshop on Fine Grained Visual Categorization, CVPR, Colorado Springs, pp 1–2 |
| [12] |
Krause J, Stark M, Deng J, Li FF (2013) 3D object representations for fine-grained categorization. In: IEEE International Conference on Computer Vision (ICCV), Sydney, pp 554–561 |
| [13] |
Lin TY, RoyChowdhury A, Maji S (2015) Bilinear CNN models for fine-grained visual recognition. In: IEEE International Conference on Computer Vision (ICCV), Santiago, pp 1449–1457 |
| [14] |
Liu Z, Lin YT, Cao Y, Hu H, Wei YX, Zhang Z et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, pp 9992–10002 |
| [15] |
Rao YM, Chen GY, Lu JW, Zhou J (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, pp 1025–1034 |
| [16] |
|
| [17] |
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Preprint at arXiv:1409.1566 |
| [18] |
|
| [19] |
Sun HB, He XT, Peng YX (2022) Sim-Trans: structure information modeling transformer for fine-grained visual categorization. Preprint at arXiv:2208.14607 |
| [20] |
Sun M, Yuan YC, Zhou F, Ding ER (2018) Multi-attention multi-class constraint for fine-grained image recognition. In: European Conference on Computer Vision (ECCV), Munich, pp 834–850 |
| [21] |
Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P et al (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, pp 595–604 |
| [22] |
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. In: Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, pp 1–11 |
| [23] |
Wang J, Yu XH, Gao YS (2021) Feature fusion vision transformer for fine-grained visual categorization. Preprint at arXiv:2107.02341 |
| [24] |
|
| [25] |
Yang Z, Luo TG, Wang D, Hu ZQ, Gao J, Wang LW (2018) Learning to navigate for fine-grained classification. In: European Conference on Computer Vision (ECCV), Munich, pp 438–454 |
| [26] |
Yu CJ, Zhao XY, Zheng Q, Zhang P, You XG (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In: European Conference on Computer Vision (ECCV), Munich, pp 595–610 |
| [27] |
Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based R-CNNs for fine-grained category detection. In: European Conference on Computer Vision (ECCV), Zurich, pp 834–849 |
| [28] |
Zheng HL, Fu JL, Mei T, Luo JB (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: IEEE International Conference on Computer Vision (ICCV), Venice, pp 5219–5227 |
| [29] |
Zheng HL, Fu JL, Zha ZJ, Luo JB (2019a) Learning deep bilinear transformation for fine-grained image representation. In: Conference on Neural Information Processing Systems (NeurIPS), Vancouver, pp 1–10 |
| [30] |
Zheng HL, Fu JL, Zha ZJ, Luo JB (2019b) Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, pp 5007–5016 |
| [31] |
Zhu HW, Ke WJ, Li D, Liu J, Tian L, Shan Y (2022) Dual cross-attention learning for fine-grained visual categorization and object re-identification. In: AAAI Conference on Artificial Intelligence(AAAI), New York, pp 4682–4692 |
| [32] |
Zhuang PQ, Wang YL, Qiao Y (2018) WildFish: a large benchmark for fish recognition in the wild. In: ACM International Conference on Multimedia (ACM MM), Seoul, pp 1301–1309 |
| [33] |
Zhuang PQ, Wang YL, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: AAAI Conference on Artificial Intelligence (AAAI), New York, pp 13130–13137 |
National Natural Science Foundation of China(62171421)
TaiShan Scholars Youth Expert Program of Shandong Province(tsqn202306096)
/
| 〈 |
|
〉 |