Duet of ViT and CNN: multi-scale dual-branch network for fine-grained image classification of marine organisms

Guangzhe Si; Zhaorui Gu; Haiyong Zheng

doi:10.1007/s44295-023-00019-8

Intelligent Marine Technology and Systems ›› 2024, Vol. 2 ›› Issue (1) DOI: 10.1007/s44295-023-00019-8

Research Paper

Duet of ViT and CNN: multi-scale dual-branch network for fine-grained image classification of marine organisms

Author information +

History +

PDF

Abstract

Fine-grained image classification of marine organisms involves dividing subcategories within a larger category. For instance, this could mean distinguishing specific species of fish or types of algae. This type of classification is more intricate than regular image classification, as the minor feature differences between subcategories are often concentrated in one or a few specific areas. Therefore, accurately identifying these critical regions and effectively using local features are crucial in improving the accuracy of fine-grained image classification. Existing methods for fine-grained image classification primarily rely on single-branch models based on convolutional neural networks (CNNs) or vision transformers (ViTs). Consequently, merging them allows for a more comprehensive understanding of marine organism images. In addition, marine organism images are affected by the distance and angle of the shot, making it challenging to capture detailed local nuances at a single scale. To address these challenges, we propose a multi-scale dual-branch network (MSDBN) that combines the strengths of ViT and CNN for fine-grained image classification of marine organisms. Our model uses a novel two-stage selection module to select discriminative regions from the ViT branch. Following this, the CNN branch executes a more detailed feature extraction on the local regions. To effectively utilise the multi-scale information of marine organisms, we introduce our designed multi-scale shift-window self-attention, specifically for the ViT branch. MSDBN demonstrates improved performance compared to existing classical methods and the best-performing dual-branch methods on three marine datasets. Our code is released publicly at https://github.com/Xiaosigz/MSDBN.

Keywords

Fine-grained image classification / Marine organisms / Convolutional neural network / Vision transformer / Dual-branch / Multi-scale

Cite this article

Download citation ▾

Guangzhe Si, Zhaorui Gu, Haiyong Zheng. Duet of ViT and CNN: multi-scale dual-branch network for fine-grained image classification of marine organisms. Intelligent Marine Technology and Systems, 2024, 2(1): DOI:10.1007/s44295-023-00019-8

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Branson S, Van Horn G, Belongie S, Perona P (2014) Bird species categorization using pose normalized deep convolutional nets. Preprint at arXiv:1406.2952

[2]	Chang DL, Ding YF, Xie JY, Bhunia AK, Li XX, Ma Z, Wu M, Guo J, Song YZ. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Trans Image Process, 2020, 29: 4683-4695,

[3]	Fu JL, Zheng HL, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp 4476–4484

[4]	Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, pp 580–587

[5]	He KM, Gkioxari G, Dollár P, Girshick R (2017) Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), Venice, pp 2980–2988

[6]	He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, pp 770–778

[7]	He J, Chen JN, Liu S, Kortylewski A, Yang C, Bai YT et al (2022) TransFG: a transformer architecture for fine-grained recognition. In: AAAI Conference on Artificial Intelligence (AAAI), Vancouver, pp 852–860

[8]	Hu XB, Zhu SN, Peng TL. Hierarchical attention vision transformer for fine-grained visual classification. Vis Commun Image Represent, 2023, 91: 1-9

[9]	Hu YQ, Jin X, Zhang Y, Hong HW, Zhang JF, He Y et al (2021) RAMS-Trans: recurrent attention multi-scale transformer for fine-grained image recognition. Preprint at arXiv:2107.08192

[10]	Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, pp 2261–2269

[11]	Khosla A, Jayadevaprakash N, Yao BP, Li FF (2011) Novel dataset for fine-grained image categorization: stanford dogs. In: First Workshop on Fine Grained Visual Categorization, CVPR, Colorado Springs, pp 1–2

[12]	Krause J, Stark M, Deng J, Li FF (2013) 3D object representations for fine-grained categorization. In: IEEE International Conference on Computer Vision (ICCV), Sydney, pp 554–561

[13]	Lin TY, RoyChowdhury A, Maji S (2015) Bilinear CNN models for fine-grained visual recognition. In: IEEE International Conference on Computer Vision (ICCV), Santiago, pp 1449–1457

[14]	Liu Z, Lin YT, Cao Y, Hu H, Wei YX, Zhang Z et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, pp 9992–10002

[15]	Rao YM, Chen GY, Lu JW, Zhou J (2021) Counterfactual attention learning for fine-grained visual categorization and re-identification. In: IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, pp 1025–1034

[16]	Si GZ, Xiao Y, Wei B, Bullock LB, Wang YY, Wang XD. Token-selective vision transformer for fine-grained image recognition of marine organisms. Front Mar Sci, 2023, 10: 1-11,

[17]	Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. Preprint at arXiv:1409.1566

[18]	Sosik HM, Olson RJ. Automated taxonomic classification of phytoplankton sampled with imaging-in-flow cytometry. Limnol Oceanogr Meth, 2007, 5: 204-216,

[19]	Sun HB, He XT, Peng YX (2022) Sim-Trans: structure information modeling transformer for fine-grained visual categorization. Preprint at arXiv:2208.14607

[20]	Sun M, Yuan YC, Zhou F, Ding ER (2018) Multi-attention multi-class constraint for fine-grained image recognition. In: European Conference on Computer Vision (ECCV), Munich, pp 834–850

[21]	Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P et al (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, pp 595–604

[22]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. In: Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, pp 1–11

[23]	Wang J, Yu XH, Gao YS (2021) Feature fusion vision transformer for fine-grained visual categorization. Preprint at arXiv:2107.02341

[24]	Wei XS, Xie CW, Wu JX, Shen CH. Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognit, 2018, 76: 704-714,

[25]	Yang Z, Luo TG, Wang D, Hu ZQ, Gao J, Wang LW (2018) Learning to navigate for fine-grained classification. In: European Conference on Computer Vision (ECCV), Munich, pp 438–454

[26]	Yu CJ, Zhao XY, Zheng Q, Zhang P, You XG (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In: European Conference on Computer Vision (ECCV), Munich, pp 595–610

[27]	Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based R-CNNs for fine-grained category detection. In: European Conference on Computer Vision (ECCV), Zurich, pp 834–849

[28]	Zheng HL, Fu JL, Mei T, Luo JB (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: IEEE International Conference on Computer Vision (ICCV), Venice, pp 5219–5227

[29]	Zheng HL, Fu JL, Zha ZJ, Luo JB (2019a) Learning deep bilinear transformation for fine-grained image representation. In: Conference on Neural Information Processing Systems (NeurIPS), Vancouver, pp 1–10

[30]	Zheng HL, Fu JL, Zha ZJ, Luo JB (2019b) Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, pp 5007–5016

[31]	Zhu HW, Ke WJ, Li D, Liu J, Tian L, Shan Y (2022) Dual cross-attention learning for fine-grained visual categorization and object re-identification. In: AAAI Conference on Artificial Intelligence(AAAI), New York, pp 4682–4692

[32]	Zhuang PQ, Wang YL, Qiao Y (2018) WildFish: a large benchmark for fish recognition in the wild. In: ACM International Conference on Multimedia (ACM MM), Seoul, pp 1301–1309

[33]	Zhuang PQ, Wang YL, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: AAAI Conference on Artificial Intelligence (AAAI), New York, pp 13130–13137