Patching the visual ability of large multimodal models by collaborating with small models

Hao LIANG; Xiaolong ZHANG; Meina KAN; Shiguang SHAN; Xilin CHEN

doi:10.1007/s11704-025-41126-5

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009705 DOI: 10.1007/s11704-025-41126-5

Image and Graphics

RESEARCH ARTICLE

Patching the visual ability of large multimodal models by collaborating with small models

Hao LIANG ¹^,²
, Xiaolong ZHANG ¹^,²
, Meina KAN ¹^,²
, Shiguang SHAN ¹^,²^,³
, Xilin CHEN ¹^,²

Author information +

History +

PDF (8674KB)

Abstract

Large multimodal models (LMMs) have demonstrated significant success across various tasks but fall short on some basic visual functions, such as inaccurate object counting and imprecise localization. These limitations restrict the application of LMMs in broad scenarios. To enhance the capabilities of LMMs, we propose a novel method to patch their visual perceptual abilities by collaborating with small task-specific models. Our method begins with utilizing an LMM to decompose the user query into a series of visual functions. For each function, the appropriate model, either the LMM itself or a small task-specific model, is invoked. To determine whether to patch the LMM with a small task-specific model, we design a novel question-answering-based reinforcement learning strategy to optimize the decision process. Finally, the LMM generates the answer utilizing the visual perceptual results. The proposed method is evaluated on two standard visual question-answering datasets and two specialized datasets. The experimental results demonstrate that our method effectively enhances the visual abilities of LMMs.

Graphical abstract

Keywords

model collaboration / patching visual ability / large multimodal models

Cite this article

Download citation ▾

Hao LIANG, Xiaolong ZHANG, Meina KAN, Shiguang SHAN, Xilin CHEN. Patching the visual ability of large multimodal models by collaborating with small models. Front. Comput. Sci., 2026, 20(9): 2009705 DOI:10.1007/s11704-025-41126-5

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Liu H, Li C, Li Y, Lee Y J. Improved baselines with visual instruction tuning. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 26296−26306

[2]	OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, , . GPT-4 technical report. 2023, arXiv preprint arXiv: 2303.08774

[3]	Yang Z, Li L, Lin K, Wang J, Lin C C, Liu Z, Wang L. The dawn of LMMs: preliminary explorations with GPT-4V (ision). 2023, arXiv preprint arXiv: 2309.17421

[4]	Liu Y, Li Z, Huang M, Yang B, Yu W, Li C, Yin X C, Liu C L, Jin L, Bai X . OCRBench: on the hidden mystery of OCR in large multimodal models. Science China Information Sciences, 2024, 67( 12): 220102

[5]	Wu C, Yin S, Qi W, Wang X, Tang Z, Duan N. Visual ChatGPT: talking, drawing and editing with visual foundation models. 2023, arXiv preprint arXiv: 2303.04671

[6]	Schick T, Dwivedi-Yu J, Dessí R, Raileanu R, Lomeli M, Hambro E, Zettlemoyer L, Cancedda N, Scialom T. Toolformer: language models can teach themselves to use tools. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2997

[7]	OpenAI. Introducing ChatGPT. See openai.com/blog/chatgpt/ website, 2022

[8]	Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[9]	Touvron H, Martin L, Stone K, Albert P, Almahairi A, , . Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288

[10]	Zeng A, Liu X, Du Z, Wang Z, Lai H, Ding M, Yang Z, Xu Y, Zheng W, Xia X, Tam W L, Ma Z, Xue Y, Zhai J, Chen W, Liu Z, Zhang P, Dong Y, Tang J. GLM-130B: an open bilingual pre-trained model. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[11]	The Vicuna Team. Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. See lmsys.org/blog/2023-03-30-vicuna/ website, 2023

[12]	Alayrac J B, Donahue J, Luc P, Miech A, Barr I, , . Flamingo: a visual language model for few-shot learning. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 23716−23736

[13]	Zhu D, Chen J, Shen X, Li X, Elhoseiny M. MiniGPT-4: enhancing vision-language understanding with advanced large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[14]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[15]	Li J, Li D, Savarese S, Hoi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 814

[16]	Gemini Team Google, Anil R, Borgeaud S, Alayrac J B, Yu J, , . Gemini: a family of highly capable multimodal models. 2023, arXiv preprint arXiv: 2312.11805

[17]	Li K, He Y, Wang Y, Li Y, Wang W, Luo P, Wang Y, Wang L, Qiao Y. VideoChat: chat-centric video understanding. 2023, arXiv preprint arXiv: 2305.06355

[18]	Li Y, Wang C, Jia J. LLaMA-VID: an image is worth 2 tokens in large language models. In: Proceedings of the 18th European Conference on Computer Vision. 2025, 323−340

[19]	Rubenstein P K, Asawaroengchai C, Nguyen D D, Bapna A, Borsos Z, , . AudioPaLM: a large language model that can speak and listen. 2023, arXiv preprint arXiv: 2306.12925

[20]	Girdhar R, El-Nouby A, Liu Z, Singh M, Alwala K V, Joulin A, Misra I. ImageBind one embedding space to bind them all. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 15180−15190

[21]	Shen Y, Song K, Tan X, Li D, Lu W, Zhuang Y. HuggingGPT: solving AI tasks with ChatGPT and its friends in hugging face. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 38154−38180

[22]	Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, , . LaMDA: language models for dialog applications. 2022, arXiv preprint arXiv: 2201.08239

[23]	Gao L, Madaan A, Zhou S, Alon U, Liu P, Yang Y, Callan J, Neubig G. PAL: program-aided language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 10764−10799

[24]	Yang R, Song L, Li Y, Zhao S, Ge Y, Li X, Shan Y. GPT4Tools: teaching large language model to use tools via self-instruction. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 3149

[25]	Yang Z, Li L, Wang J, Lin K, Azarnasab E, Ahmed F, Liu Z, Liu C, Zeng M, Wang L. MM-REACT: prompting ChatGPT for multimodal reasoning and action. 2023, arXiv preprint arXiv: 2303.11381

[26]	Weng Y, He S, Liu K, Liu S, Zhao J. ControlLM: crafting diverse personalities for language models. 2024, arXiv preprint arXiv: 2402.10151

[27]	Liu S, Cheng H, Liu H, Zhang H, Li F, Ren T, Zou X, Yang J, Su H, Zhu J, Zhang L, Gao J, Li C. LLaVA-Plus: learning to use tools for creating multimodal agents. In: Proceedings of the 18th European Conference on Computer Vision. 2025, 126−142

[28]	Yang Z, Chen G, Li X, Wang W, Yang Y. DoraemonGPT: toward understanding dynamic scenes with large language models (Exemplified as a video agent). In: Proceedings of the 41st International Conference on Machine Learning. 2024, 55976−55997

[29]	Yang Y, Zhuang Y, Pan Y . Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 2021, 22( 12): 1551–1558

[30]	Quan R, Dong X, Wu Y, Zhu L, Yang Y. Auto-ReID: searching for a part-aware ConvNet for person re-identification. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 3750−3759

[31]	Liu H, Simonyan K, Yang Y. DARTS: differentiable architecture search. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[32]	Chung H W, Hou L, Longpre S, Zoph B, Tai Y, . . Scaling instruction-finetuned language models. Journal of Machine Learning Research, 2024, 25( 70): 1–53

[33]

Iyer S, Lin X V, Pasunuru R, Mihaylov T, Simig D, Yu P, Shuster K, Wang T, Liu Q, Koura P S, Li X, O’Horo B, Pereyra G, Wang J, Dewan C, Celikyilmaz A, Zettlemoyer L, Stoyanov V. OPT-IML: scaling language model instruction meta learning through the lens of generalization. 2022, arXiv preprint arXiv: 2212.12017

[34]	Wang Y, Kordi Y, Mishra S, Liu A, Smith N A, Khashabi D, Hajishirzi H. Self-instruct: aligning language models with self-generated instructions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 13484−13508

[35]

Singhal K, Azizi S, Tu T, Mahdavi S S, Wei J, Chung H W, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y arcas B, Webster D, Corrado G S, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V . Large language models encode clinical knowledge. Nature, 2023, 620( 7972): 172–180

[36]	Taylor R, Kardas M, Cucurull G, Scialom T, Hartshorn A, Saravia E, Poulton A, Kerkez V, Stojnic R. Galactica: a large language model for science. 2022, arXiv preprint arXiv: 2211.09085

[37]	Wu S, Irsoy O, Lu S, Dabravolski V, Dredze M, Gehrmann S, Kambadur P, Rosenberg D, Mann G. BloombergGPT: a large language model for finance. 2023, arXiv preprint arXiv: 2303.17564

[38]	Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M. MINILM: deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 485

[39]	Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347

[40]	Liu H, Li C, Wu Q, Lee Y J. Visual instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 1516

[41]	Yu W, Yang Z, Li L, Wang J, Lin K, Liu Z, Wang X, Wang L. MM-Vet: evaluating large multimodal models for integrated capabilities. In: Proceedings of the 41st International Conference on Machine Learning. 2024, 57730−57754

[42]	Gurari D, Li Q, Stangl A J, Guo A, Lin C, Grauman K, Luo J, Bigham J P. VizWiz grand challenge: answering visual questions from blind people. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3608−3617

[43]	Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6904−6913

[44]	Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 213−229

[45]	Cheng B, Misra I, Schwing A G, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1290−1299

[46]	Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, Jiang Q, Li C, Yang J, Su H, Zhu J, Zhang L. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In: Proceedings of the 18th European Conference on Computer Vision. 2025, 38−55

[47]	Li J, Li D, Xiong C, Hoi S C H. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 12888−12900

[48]	Baidu. Cash recognition API document. See ai.baidu.com/ai-doc/IMAGERECOGNITION/pk3bcxavy website, 2024

[49]	PaddleOCR. PaddleOCR project. See github.com/PaddlePaddle/PaddleOCR website, 2024

[50]	Trpakov. Vision transformer (ViT) for facial expression recognition model card. See huggingface.co/trpakov/vit-face-expression website, 2024

[51]	Rvv-karma. Human action recognition ViT model card. See huggingface.co/rvv-karma/Human-Action-Recognition-VIT-Base-patch16-224 website, 2024

[52]	Baidu. Dish recognition API document. See ai.baidu.com/ai-doc/IMAGERECOGNITION/tk3bcxbb0 website, 2024

[53]	Baidu. Fruit and vegetable recognition API document. See ai.baidu.com/ai-doc/IMAGERECOGNITION/wk3bcxevq website, 2024

[54]	Zhang X, Liang L, Zhao S, Wang Z . GRFB-UNet: a new multi-scale attention network with group receptive field block for tactile paving segmentation. Expert Systems with Applications, 2024, 238: 122109

[55]	Yu S, Lee H, Kim J. LYTNet: a convolutional neural network for real-time pedestrian traffic lights and zebra crossing recognition for the visually impaired. In: Proceedings of the 18th International Conference on Computer Analysis of Images and Patterns. 2019, 259−270

[56]	Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: common objects in context. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 740−755

[57]	Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A. Scene parsing through ADE20K dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 633−641

[58]	Neuhold G, Ollmann T, Rota Bulò S, Kontschieder P. The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 4990−4999

[59]

Goodfellow I J, Erhan D, Luc Carrier P, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y, Thaler D, Lee D H, Zhou Y, Ramaiah C, Feng F, Li R, Wang X, Athanasakis D, Shawe-Taylor J, Milakov M, Park J, Ionescu R, Popescu M, Grozea C, Bergstra J, Xie J, Romaszko L, Xu B, Chuang Z, Bengio Y . Challenges in representation learning: a report on three machine learning contests. Neural Networks, 2015, 64: 59–63