Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey

Junming FAN , Yue YIN , Tian WANG , Wenhang DONG , Pai ZHENG , Lihui WANG

Front. Eng ›› 2025, Vol. 12 ›› Issue (1) : 177 -200.

PDF (3052KB)
Front. Eng ›› 2025, Vol. 12 ›› Issue (1) : 177 -200. DOI: 10.1007/s42524-025-4136-9
Industrial Engineering and Intelligent Manufacturing
REVIEW ARTICLE

Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey

Author information +
History +
PDF (3052KB)

Abstract

human–robot collaboration (HRC) is set to transform the manufacturing paradigm by leveraging the strengths of human flexibility and robot precision. The recent breakthrough of Large Language Models (LLMs) and Vision-Language Models (VLMs) has motivated the preliminary explorations and adoptions of these models in the smart manufacturing field. However, despite the considerable amount of effort, existing research mainly focused on individual components without a comprehensive perspective to address the full potential of VLMs, especially for HRC in smart manufacturing scenarios. To fill the gap, this work offers a systematic review of the latest advancements and applications of VLMs in HRC for smart manufacturing, which covers the fundamental architectures and pretraining methodologies of LLMs and VLMs, their applications in robotic task planning, navigation, and manipulation, and role in enhancing human–robot skill transfer through multimodal data integration. Lastly, the paper discusses current limitations and future research directions in VLM-based HRC, highlighting the trend in fully realizing the potential of these technologies for smart manufacturing.

Graphical abstract

Keywords

vision-language models / large language models / human–robot collaboration / smart manufacturing

Cite this article

Download citation ▾
Junming FAN, Yue YIN, Tian WANG, Wenhang DONG, Pai ZHENG, Lihui WANG. Vision-language model-based human-robot collaboration for smart manufacturing: A state-of-the-art survey. Front. Eng, 2025, 12(1): 177-200 DOI:10.1007/s42524-025-4136-9

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

AchiamJAdlerSAgarwalSAhmadLAkkayaIAlemanF LAlmeidaDAltenschmidtJAltmanSAnadkatS others (2023). Gpt-4 technical report. arXiv preprint arXiv:230308774

[2]

Anthropic (2023). The Claude 3 Model Family: Opus, Sonnet, Haiku.

[3]

AskellABaiYChenADrainDGanguliDHenighanTJonesAJosephNMannBDasSarmaN others (2021). A general language assistant as a laboratory for alignment. arXiv preprint arXiv:211200861

[4]

Azagra P, Civera J, Murillo A C, (2020). Incremental learning of object models from natural human–robot interactions. IEEE Transactions on Automation Science and Engineering, 17( 4): 1883–1900

[5]

BrownTMannBRyderNSubbiahMKaplanJ DDhariwalPNeelakantanAShyamPSastryGAskellA others (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems 33:1877–1901

[6]

Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, Chen H, Yi X, Wang C, Wang Y, Ye W, Zhang Y, Chang Y, Yu PS, Yang Q, Xie X, (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15( 3): 1–45

[7]

Chen Q, Pitawela D, Zhao C, Zhou G, Chen H T, Wu Q, (2024). WebVLN: Vision-and-language navigation on websites. In: Proceedings of the AAAI Conference on Artificial Intelligence, 38( 2): 1165–1173

[8]

ChenTKornblithSNorouziMHintonG (2020). A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. PMLR, 1597–1607

[9]

Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A others, (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24( 240): 1–113

[10]

Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li Y, Wang X, Dehghani M, Brahma S others, (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25( 70): 1–53

[11]

DingMXuYChenZCoxD DLuoPTenenbaumJ BGanC (2023). Embodied concept learner: Self-supervised learning of conceptsmappingthrough Instruction Following. In: Conference on Robot Learning. PMLR, 1743–1754

[12]

DongQLiLDaiDZhengCWuZChangBSunXXuJSuiZ(2022). A survey on in-context learning. arXiv preprint arXiv:230100234

[13]

DosovitskiyABeyerLKolesnikovAWeissenbornDZhaiXUnterthinerTDehghaniMMindererMHeigoldGGellyS others (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations

[14]

Dou Z Y, Kamath A, Gan Z, Zhang P, Wang J, Li L, Liu Z, Liu C, LeCun Y, Peng N others, (2022). Coarse-to-fine vision-language pre-training with fusion in the backbone. Advances in Neural Information Processing Systems, 35: 32942–32956

[15]

DriessDXiaFSajjadiM SLynchCChowdheryAIchterBWahidATompsonJVuongQYuT others (2023). PaLM-E: An embodied multimodal language model. In: International Conference on Machine Learning. PMLR, 8469–8488

[16]

DuYYangMFlorencePXiaFWahidAIchterBSermanetPYuTAbbeelPTenenbaumJ B others (2023). Video language planning. arXiv preprint arXiv:231010625

[17]

Fan J, Zheng P, (2024). A vision-language-guided robotic action planning approach for ambiguity mitigation in human–robot collaborative manufacturing. Journal of Manufacturing Systems, 74: 1009–1018

[18]

Fan J, Zheng P, Li S, (2022). Vision-based holistic scene understanding towards proactive human–robot collaboration. Robotics and Computer-integrated Manufacturing, 75: 102304

[19]

FuZLamWYuQSoA M CHuSLiuZCollierN (2023). Decoder-only or encoder-decoder? interpreting language model as a regularized encoder-decoder. arXiv preprint arXiv:230404052

[20]

Gao C, Liu S, Chen J, Wang L, Wu Q, Li B, Tian Q, (2024). Room-object entity prompting and reasoning for embodied referring expression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46( 2): 994–1010

[21]

GervetTXianZGkanatsiosNFragkiadakiK (2023). Act3D: 3D feature field transformers for multi-task robotic manipulation. In: Conference on Robot Learning. PMLR, 3949–3965

[22]

GLM T ZengAXuBWangBZhangCYinDRojasDFengGZhaoHLaiH (2024). ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools. arXiv preprint arXiv:240612793

[23]

GoodwinWVazeSHavoutisIPosnerI (2022). Semantically grounded object matching for robust robotic scene rearrangement. In: 2022 International Conference on Robotics and Automation (ICRA). IEEE, Philadelphia, PA, USA, 11138–11144

[24]

GuQKuwajerwalaAMorinSJatavallabhulaK MSenBAgarwalARiveraCPaulWEllisKChellappaR others (2023). Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning. arXiv preprint arXiv:230916650

[25]

Halim J, Eichler P, Krusche S, Bdiwi M, Ihlenfeldt S, (2022). No-code robotic programming for agile production: A new markerless-approach for multimodal natural interaction in a human-robot collaboration context. Frontiers in Robotics and AI, 9: 1001955

[26]

HanRLiuNLiuCGouTSunF (2024). Enhancing robot manipulation skill learning with multi-task capability based on transformer and token reduction. In: Cognitive Systems and Information Processing. Springer Nature Singapore, Singapore, 121–135

[27]

HeKFanHWuYXieSGirshickR (2020). Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 9729–9738

[28]

HeKZhangXRenSSunJ (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp 770–778

[29]

HePLiuXGaoJChenW (2021). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv: 200603654

[30]

HongYZhouYZhangRDernoncourtFBuiTGouldSTanH (2023). Learning navigational visual representations with semantic map supervision. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Paris, France, pp 3032–3044

[31]

HoriCPengPHarwathDLiuXOtaKJainSCorcodelRJhaDRomeresDLe RouxJ (2023). Style-transfer based speech and audio-visual scene understanding for robot action sequence acquisition from videos. arXiv preprint arXiv: 230615644

[32]

HuYLinFZhangTYiLGaoY (2023) Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv: 231117842

[33]

HuangCMeesOZengABurgardW (2023a). Visual language maps for robot navigation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10608–10615

[34]

HuangWWangCZhangRLiYWuJFei-FeiL (2023b). Voxposer: Composable 3D value maps for robotic manipulation with language models. In: Conference on Robot Learning. PMLR, 540–562

[35]

JangEIrpanAKhansariMKapplerDEbertFLynchCLevineSFinnC (2022). Bc-z: Zero-shot task generalization with robotic imitation learning. In: Conference on Robot Learning. PMLR, 991–1002

[36]

JangJKongCJeonDKimSKwakN (2023). Unifying vision-language representation space with single-tower transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. 980–988

[37]

JiaCYangYXiaYChenY TParekhZPhamHLeQSungY HLiZDuerigT (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. PMLR, 4904–4916

[38]

KentonJ D M W CToutanovaL K (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT. pp 4171–4186

[39]

KhandelwalAWeihsLMottaghiRKembhaviA (2022). Simple but effective: Clip embeddings for embodied AI. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, 14809–14818

[40]

KimJKangG CKimJShinSZhangB T (2023a). GVCCI: Lifelong learning of visual grounding for language-guided robotic manipulation. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Detroit, MI, USA, 952–959

[41]

KimSJooS JKimDJangJYeSShinJSeoM (2023b). The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. In: The 2023 Conference on Empirical Methods in Natural Language Processing. 12685–12708

[42]

KojimaTGuSReidMMatsuoYIwasawaY (2022). Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems. 22199–22213

[43]

KorekataRKambaraMYoshidaYIshikawaSKawasakiYTakahashiMSugiuraK (2023). Switching head-tail funnel UNITER for dual referring expression comprehension with fetch-and-carry tasks. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, Detroit, MI, USA, 3865–3872

[44]

LewisMLiuYGoyalNGhazvininejadMMohamedALevyOStoyanovVZettlemoyerL (2020). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880

[45]

Li J, Padmakumar A, Sukhatme G, Bansal M, (2024). Vln-video: Utilizing driving videos for outdoor vision-and-language navigation. Proceedings of the AAAI Conference on Artificial Intelligence, 38( 17): 18517–18526

[46]

Lin B, Nie Y, Wei Z, Zhu Y, Xu H, Ma S, Liu J, Liang X, (2024). Correctable landmark discovery via large models for vision-language navigation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46( 12): 1–14

[47]

Liu S, Zhang J, Wang L, Gao R X, (2024). Vision AI-based human-robot collaborative assembly driven by autonomous robots. CIRP Annals, 73( 1): 13–16

[48]

LiuYOttMGoyalNDuJJoshiMChenDLevyOLewisMZettlemoyerLStoyanovV(2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:190711692

[49]

Lu S, Berger J, Schilp J, (2022). System of robot learning from multi-modal demonstration and natural language instruction. Procedia CIRP, 107: 914–919

[50]

Matheson E, Minto R, Zampieri E G, Faccio M, Rosati G, (2019). Human–robot collaboration in manufacturing applications: A review. Robotics, 8( 4): 100

[51]

MeiAWangJZhuG NGanZ (2024). GameVLM: A decision-making framework for robotic task planning based on visual language models and zero-sum games. arXiv preprint arXiv:24051375

[52]

Mohammadi B, Hong Y, Qi Y, Wu Q, Pan S, Shi J Q, (2024). Augmented commonsense knowledge for remote object grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 38( 5): 4269–4277

[53]

MurrayMGuptaACakmakM (2024). Teaching robots with show and tell: Using foundation models to synthesize robot policies from language and visual demonstration. In: 8th Annual Conference on Robot Learning

[54]

NairSRajeswaranAKumarVFinnCGuptaA (2022). R3M: A universal visual representation for robot manipulation. In: Conference on Robot Learning. PMLR, 892–909

[55]

ParkSMenassaC CKamatV R (2024). Integrating large language models with multimodal virtual reality interfaces to support collaborative human-robot construction work. arXiv preprint arXiv:240403498

[56]

PengABobuALiB ZSumersT RSucholutskyIKumarNGriffithsT LShahJ A (2024). Preference-conditioned language-guided abstraction. In: Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction. ACM, Boulder CO USA, 572–581

[57]

PengBLiCHePGalleyMGaoJ (2023). Instruction tuning with gpt-4. arXiv preprint arXiv:230403277

[58]

QiaoYQiYYuZLiuJWuQ (2023). March in chat: interactive prompting for remote embodied referring expression. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Paris, France, 15712–15721

[59]

RadfordAKimJ WHallacyCRameshAGohGAgarwalSSastryGAskellAMishkinPClarkJ others (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, 8748–8763

[60]

RadfordANarasimhanKSalimansTSutskeverI (2018). Improving language understanding by generative pre-training. OpenAI blog

[61]

RadfordAWuJChildRLuanDAmodeiDSutskeverIothers (2019). Language models are unsupervised multitask learners. OpenAI blog 1(8):9

[62]

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu P J, (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21( 140): 1–67

[63]

RamrakhyaRUndersanderEBatraDDasA (2022). Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 5173–5183

[64]

RanaKHavilandJGargSAbou-ChakraJReidISuenderhaufN (2023). Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In: 7th Annual Conference on Robot Learning. pp 23–72

[65]

SanhVWebsonARaffelCBachS HSutawikaLAlyafeaiZChaffinAStieglerALeScao TRajaA others (2022). Multitask prompted training enables zero-shot task generalization. In: International Conference on Learning Representations

[66]

Schumann R, Zhu W, Feng W, Fu T J, Riezler S, Wang W Y, (2024). VELMA: Verbalization embodiment of LLM agents for vision and language navigation in street view. Proceedings of the AAAI Conference on Artificial Intelligence, 38( 17): 18924–18933

[67]

ShahDOsinskiBIchterBLevineS (2022). Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In: Conference on Robot Learning. PMLR, 492–504

[68]

Shao L, Migimatsu T, Zhang Q, Yang K, Bohg J, (2021). Concept2Robot: Learning manipulation concepts from instructions and human demonstrations. International Journal of Robotics Research, 40( 12-14): 1419–1434

[69]

SharmaSHuangHShivakumarKChenL YHoqueRIchterBGoldbergK (2023). Semantic mechanical search with large vision and language models. In: Conference on Robot Learning. PMLR, 971–1005

[70]

ShuklaRManyarO MRanpariaDGuptaS K (2023). A framework for improving information content of human demonstrations for enabling robots to acquire complex tool manipulation skills. In: 2023 32nd IEEE International Conference on Robot and Human Interactive Communication. IEEE, Busan, Korea, Republic of, 2273–2280

[71]

SinghAHuRGoswamiVCouaironGGalubaWRohrbachMKielaD (2022). Flava: A foundational language and vision alignment model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15638–15650

[72]

SkretaMZhouZYuanJ LDarvishKAspuru-GuzikAGargA (2024). Replan: Robotic replanning with perception and language models. arXiv preprint arXiv:240104157

[73]

SongC HWuJWashingtonCSadlerB MChaoW LSuY (2023). Llm-planner: Few-shot grounded planning for embodied agents with large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2998–3009

[74]

SongDLiangJPayandehAXiaoXManochaD (2024). Socially aware robot navigation through scoring using vision-language models. arXiv preprint arXiv:240400210

[75]

SontakkeS AZhangJArnoldS M RPertschKBıyıkESadighDFinnCIttiL (2024). Roboclip: One demonstration is enough to learn robot policies. In: Proceedings of the 37th International Conference on Neural Information Processing Systems, 55681-55693

[76]

StoneAXiaoTLuYGopalakrishnanKLeeK HVuongQWohlhartPKirmaniSZitkovichBXiaFFinnCHausmanK (2023). Open-world object manipulation using pre-trained vision-language models. In: Conference on Robot Learning. PMLR, 3397–3417

[77]

SunYWangSFengSDingSPangCShangJ others (2021). Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:210702137

[78]

TanMLeQ (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR, 6105–6114

[79]

TayYDehghaniMTranVGarciaXWeiJWangXChungH WBahriDSchusterTZhengSZhouDHoulsbyNMetzlerD (2023). UL2: Unifying Language Learning Paradigms. In: The Eleventh International Conference on Learning Representations

[80]

TeamGAnilRBorgeaudSWuYAlayracJ BYuJ others (2023). Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:231211805

[81]

ThoppilanRDeFreitas DHallJShazeerNKulshreshthaAChengH T others (2022). Lamda: Language models for dialog applications. arXiv preprint arXiv:220108239

[82]

TouvronHLavrilTIzacardGMartinetXLachauxM ALacroixT others (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971

[83]

Trick S, Herbert F, Rothkopf C A, Koert D, (2022). Interactive reinforcement learning with Bayesian fusion of multimodal advice. IEEE Robotics and Automation Letters, 7( 3): 7558–7565

[84]

TschannenMMustafaBHoulsbyN (2022). Image-and-language understanding from pixels only. arXiv preprint arXiv:221208045

[85]

VaswaniAShazeerNParmarNUszkoreitJJonesLGomezA N Kaiser LukaszPolosukhinI (2017). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010

[86]

WangJWangTXuLHeZSunC (2024a). Discovering intrinsic subgoals for vision-and-language navigation via hierarchical reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 1–13

[87]

Wang L, Gao R, Váncza J, Krüger J, Wang X V, Makris S, Chryssolouris G, (2019). Symbiotic human-robot collaborative assembly. CIRP Annals, 68( 2): 701–726

[88]

Wang T, Fan J, Zheng P, (2024b). An LLM-based vision and language cobot navigation approach for Human-centric Smart Manufacturing. Journal of Manufacturing Systems, 75: 299–305

[89]

WangTRobertsAHesslowDLe ScaoTChungH WBeltagyILaunayJRaffelC (2022a). What language model architecture and pretraining objective works best for zero-shot generalization? In: International Conference on Machine Learning. PMLR, 22964–22984

[90]

Wang T, Zheng P, Li S, Wang L, (2024c). Multimodal human–robot interaction for human-centric smart manufacturing: A survey. Advanced Intelligent Systems, 6( 3): 2300359

[91]

Wang W, Li R, Chen Y, Sun Y, Jia Y, (2022b). Predicting human intentions in human–robot hand-over tasks through multimodal learning. IEEE Transactions on Automation Science and Engineering, 19( 3): 2339–2353

[92]

Wang X, Wang W, Shao J, Yang Y, (2024d). Learning to follow and generate instructions for language-capable navigation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46( 5): 3334–3350

[93]

Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le Q V, Zhou D others, (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35: 24824–24837

[94]

WiYMark V derMPeteFZengAFazeliN (2023). CALAMARI: Contact-aware and language conditioned spatial action mapping for contact-rich manipulation. In: Conference on Robot Learning. PMLR, 2753–2771

[95]

Winge C, Imdieke A, Aldeeb B, Kang D, Desingh A, (2024). Talk through it: End user directed manipulation learning. IEEE Robotics and Automation Letters, 9( 9): 8051–8058

[96]

WuZWangZXuXLuJYanH (2023). Embodied task planning with large language models. arXiv preprint arXiv:230701848

[97]

Yao L, Han J, Wen Y, Liang X, Xu D, Zhang W, Li Z, Xu C, Xu H, (2022). Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. Advances in Neural Information Processing Systems, 35: 9125–9138

[98]

Yin C, Zhang Q, (2023). A multi-modal framework for robots to learn manipulation tasks from human demonstrations. Journal of Intelligent & Robotic Systems, 107( 4): 56

[99]

Yin Y, Zheng P, Li C, Wan K, (2024). Enhancing human-guided robotic assembly: AR-assisted DT for skill-based and low-code programming. Journal of Manufacturing Systems, 74: 676–689

[100]

YuJWangZVasudevanVYeungLSeyedhosseiniMWuY (2022). Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:220501917

[101]

YuTZhouZChenYXiongR (2023). Learning object spatial relationship from demonstration. In: 2023 2nd International Conference on Machine Learning, Cloud Computing and Intelligent Mining. 370–376

[102]

ZeYYanGWuY HMacalusoAGeYYeJHansenNLiL EWangX (2023). GNFactor: Multi-task real robot learning with generalizable neural feature fields. In: Conference on Robot Learning. PMLR, 284–301

[103]

Zhang J, Huang J, Jin S, Lu S, (2024a). Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46( 8): 5625–5644

[104]

ZhangJWangKXuRZhouGHongYFangXWuQZhangZWangH (2024b). NaVid: Video-based VLM plans the next step for vision-and-language navigation. arXiv preprint arXiv:240215852

[105]

ZhangZHanXLiuZJiangXSunMLiuQ (2019). ERNIE: Enhanced Language Representation with Informative Entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1441–1451

[106]

ZhaoW XZhouKLiJTangTWangXHouYMinYZhangBZhangJDongZ others (2023a). A survey of large language models. arXiv preprint arXiv:230318223

[107]

ZhaoXLiMWeberCHafezM BWermterS (2023b). Chat with the environment: Interactive multimodal perception using large language models. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 3590–3596

[108]

Zheng P, Li C, Fan J, Wang L, (2024). A vision-language-guided and deep reinforcement learning-enabled approach for unstructured human-robot collaborative manufacturing task fulfilment. CIRP Annals, 73( 1): 341–344

[109]

Zhou G, Hong Y, Wu Q, (2024). Navgpt: Explicit reasoning in vision-and-language navigation with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38( 7): 7641–7649

[110]

Zhou K, Yang J, Loy C C, Liu Z, (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130( 9): 2337–2348

[111]

ZieglerD MStiennonNWuJBrownT BRadfordAAmodeiDChristianoPIrvingG (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:190908593

[112]

ZitkovichBYuTXuSXuPXiaoTXiaFWuJWohlhartPWelkerSWahidA others (2023). Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. PMLR, 2165–2183

RIGHTS & PERMISSIONS

The Author(s). This article is published with open access at link.springer.com and journal. hep.com.cn

AI Summary AI Mindmap
PDF (3052KB)

1872

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/