
Developing ChatGPT for biology and medicine: a complete review of biomedical question answering
Qing Li, Lei Li, Yu Li
Biophysics Reports ›› 2024, Vol. 10 ›› Issue (3) : 152-171.
Developing ChatGPT for biology and medicine: a complete review of biomedical question answering
ChatGPT explores a strategic blueprint of question answering (QA) to deliver medical diagnoses, treatment recommendations, and other healthcare support. This is achieved through the increasing incorporation of medical domain data via natural language processing (NLP) and multimodal paradigms. By transitioning the distribution of text, images, videos, and other modalities from the general domain to the medical domain, these techniques have accelerated the progress of medical domain question answering (MDQA). They bridge the gap between human natural language and sophisticated medical domain knowledge or expert-provided manual annotations, handling large-scale, diverse, unbalanced, or even unlabeled data analysis scenarios in medical contexts. Central to our focus is the utilization of language models and multimodal paradigms for medical question answering, aiming to guide the research community in selecting appropriate mechanisms for their specific medical research requirements. Specialized tasks such as unimodal-related question answering, reading comprehension, reasoning, diagnosis, relation extraction, probability modeling, and others, as well as multimodal-related tasks like vision question answering, image captioning, cross-modal retrieval, report summarization, and generation, are discussed in detail. Each section delves into the intricate specifics of the respective method under consideration. This paper highlights the structures and advancements of medical domain explorations against general domain methods, emphasizing their applications across different tasks and datasets. It also outlines current challenges and opportunities for future medical domain research, paving the way for continued innovation and application in this rapidly evolving field. This comprehensive review serves not only as an academic resource but also delineates the course for future probes and utilization in the field of medical question answering.
ChatGPT / Medical question answering / Nature language processing / Multimodal paradigms / Large language models
[1] |
Abacha BA, Hasan SA, Datla VV, Demner-Fushman D, Müller H (2019) Vqa-med: overview of the medical visual question answering task at imageclef 2019. Proceedings of Conference and Labs of the Evaluation Forum.
|
[2] |
Alayrac JB , Donahue J , Luc P , Miech A , Barr I , Hasson Y , Lenc K , Mensch A , Millican K , Reynolds M , Ring R , Rutherford E , Cabi S , Han T , Gong Z , Samangooei S , Moteiro M , Menick J , Borgeaud S , Brock A , Nematzadeh A , Sharifzadeh S , Binkowski M , Barreira R , Vinyals O , Zisserman A . Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst, 2022, 35: 23716–23736
|
[3] |
Bosma M , Mishra G , Roberts A , Barham P , Chung HW , Sutton C , Gehrmann S , Schuh P , Shi K , Tsvyashchenko S , Maynez J , Rao A , Barnes P , Tay Y , Shazeer N , Prabhakaran V , Reif E , Du N , Hutchinson B , Pope R , Bradbury J , Austin J , Isard M , Gur-Ari G , Yin P , Duke T , Levskaya A , Ghemawat S , Dev S , Michalewski H , Garcia X , Misra V , Robinson K , Fedus L , Zhou D , Ippolito D , Luan D , Lim H , Zoph B , Spiridonov A , Sepassi R , Dohan D , Agrawal S , Omernick M , Dai AM , Pillai TS , Pellat M , Lewkowycz A , Moreira E , Child R , Polozov O , Lee K , Zhou Z , Wang X , Saeta B , Diaz M , Firat O , Catasta M , Wei J , Meier-Hellstern K , Eck D , Dean J , Petrov S , Fiedel N . Palm: Scaling language modeling with pathways. J Mach Learn Res, 2023, 24(240): 1–113
|
[4] |
Brown T , Mann B , Ryder N , Subbiah M , Kaplan JD , Dhariwal P , Neelakantan A , Shyam P , Sastry G , Askell A , Agarwal S , Herbert-Voss A , Krueger G , Henighan T , Child R , Ramesh A , Ziegler DM , Wu J , Winter C , Hesse C , Chen M , Sigler E , Litwin M , Gray S , Chess B , Clark J , Berner C , McCandlish S , Radford A , Sutskever I , Amodei D . Language models are few-shot learners. Adv Neural Inf Process Syst, 2020, 33: 1877–1901
|
[5] |
Cai X , Liu S , Han J , Yang L , Liu Z , Liu T . ChestXRayBERT: a pretrained language model for chest radiology report summarization. IEEE Trans Multimed, 2021, 25: 845–855
CrossRef
Google scholar
|
[6] |
Chen J, Zhu D, Shen X, Li X, Liu Z, Zhang P, Krishnamoorthi R, Chandra V, Xiong Y, Elhoseiny M (2023a) Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv: 2310.09478.
|
[7] |
Chen YC, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Liu J (2020) Uniter: universal image-text representation learning. European conference on computer vision. pp. 104−120
|
[8] |
Chen Z, Cano AH, Romanou A, Bonnet A, Matoba K, Salvi F, Pagliardini M, Fan S, Köpf A, Mohtashami A, Sallinen A, Sakhaeirad A, Swamy V, Krawczuk I, Bayazit D, Marmet A, Montariol S, Hartley MA, Jaggi M, Bosselut A (2023b) MEDITRON-70B: scaling medical pretraining for large language model. arXiv: 2311.16079.
|
[9] |
Cheng J, Ye J, Deng Z, Chen J, Li T, Wang H, Su Y, Huang Z, Chen J, Jiang L, Sun H, He J, Zhang S, Zhu M, Qiao Y (2023) SAM-Med2D. arXiv: 2308.16184.
|
[10] |
Cui Y, Che W, Liu T, Qin B, Wang S, Hu G (2020) Revisiting pre-trained models for Chinese natural language processing. arXiv: 2004.13922.
|
[11] |
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805.
|
[12] |
Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, Hon HW (2019) Unified language model pre-training for natural language understanding and generation. Proceedings of the 33rd International Conference on Neural Information Processing Systems. pp. 13063–13075
|
[13] |
Driess D, Xia F, Sajjadi MS, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W, Chebotar Y, Sermanet P, Duckworth D, Levine S, Vanhoucke V, Hausman K, Toussaint M, Greff K, Zeng A, Mordatch I, Florence P (2023) PaLM-E: an embodied multimodal language model. arXiv: 2303.03378.
|
[14] |
Du N, Huang Y, Dai AM, Tong S, Lepikhin D, Xu Y, Krikun M, Zhou Y, Yu AW, Firat O, Zoph B, Fedus L, Bosma M, Zhou Z, Wang T, Wang YE, Webster K, Pellat M, Robinson K, Meier-Hellstern K, Duke T, Dixon L, Zhang K, Le QV, Wu Y, Chen Z, Cui C (2022) Glam: efficient scaling of language models with mixture-of-experts. Proceedings of the 39th International Conference on Machine Learning. pp. 5547−5569
|
[15] |
Eslami S, de Melo G, Meinel C (2021) Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? arXiv: 2112.13906.
|
[16] |
Gardères F, Ziaeefard M, Abeloos B, Lecue F (2020) Conceptbert: concept-aware representation for visual question answering. Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 489−498
|
[17] |
Gu Y , Tinn R , Cheng H , Lucas M , Usuyama N , Liu X , Naumann T , Gao J , Poon H . Domain-specific language model pretraining for biomedical natural language processing. ACM Trans ComputHealthc, 2021, 3(1): 1–23
|
[18] |
Hu X, Gu L, Kobayashi K, An Q, Chen Q, Lu Z, Su C, Harada T, Zhu Y (2023) Interpretable medical image visual question answering via multi-modal relationship graph learning. arXiv: 2302.09636.
|
[19] |
Kanakarajan KR, Kundumani B, Sankarasubbu M (2021) BioELECTRA: pretrained biomedical text encoder using discriminators. Proceedings of the 20th Workshop on Biomedical Language Processing. pp. 143−154
|
[20] |
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo WY, Dolla´r P, and Girshick R (2023) Segment anything. arXiv: 2304.02643.
|
[21] |
Kim S, Joo SJ, Kim D, Jang J, Ye S, Shin J, Seo M (2023) The COT COLLECTION: improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv: 2305.14045.
|
[22] |
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv: 1909.11942.
|
[23] |
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv: 1901.08746.
|
[24] |
Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, Naumann T, Poon H, Gao J (2023a) LLaVA-Med: large language-and-vision assistant for biomedicine. arXiv: 2304.04342.
|
[25] |
Liévin V, Hother CE, Motzfeldt AG, Winther O (2022) Can large language models reason about medical questions? arXiv: 2207.08143.
|
[26] |
Li P, Liu G, Tan L, Liao J, Zhong S (2023b) Self-supervised vision-language pretraining for medial visual question answering. arXiv: 2211.13594.
|
[27] |
Liu Y, Wang Z, Xu D, Zhou L (2023) Q2ATransformer: Improving medical vqa via an answer querying decoder. arXiv: 2304.01611.
|
[28] |
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv: 1908.02265.
|
[29] |
Luo R , Sun L , Xia Y , Qin T , Zhang S , Poon H , Liu TY . BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform, 2022, 23(6): bbac409
CrossRef
Google scholar
|
[30] |
Luo Y, Zhang J, Fan S, Yang K, Wu Y, Qiao M, Nie Z (2023) BioMedGPT: open multimodal generative pre-trained transformer for biomedicine. arXiv: 2308.09442.
|
[31] |
Ma L, Han J, Wang Z, Zhang D (2023) CephGPT-4: an interactive multimodal cephalometric measurement and diagnostic system with visual large language model. arXiv: 2307.07518.
|
[32] |
Manmadhan S , Kovoor BC . Parallel multi-head attention and term-weighted question embedding for medical visual question answering. Mult Tools Appl, 2023, 82: 34937–34958
CrossRef
Google scholar
|
[33] |
Moor M, Huang Q, Wu S, Yasunaga M, Zakka C, Dalmia Y, Reis EP, Rajpurkar P, Leskovec J (2023) Med-Flamingo: a multimodal medical few-shot learner. arXiv: 2307.15189.
|
[34] |
Nori H, King N, McKinney SM, Carignan D, Horvitz E (2023) Capabilities of GPT-4 on medical challenge problems. arXiv: 2303.13375.
|
[35] |
OpenAI (2022) Introducing ChatGPT.
|
[36] |
Ouyang L , Wu J , Jiang X , Almeida D , Wainwright C , Mishkin P , Zhang C , Agarwal S , Slama K , Ray A , Schulman J , Hilton J , Kelton F , Miller L , Simens M , Askell A , Welinder P , Christiano P , Leike J , Lowe R . Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst, 2022, 35: 27730–27744
|
[37] |
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, and Sutskever I (2021) Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning. pp. 8748−8763
|
[38] |
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8): 9.
|
[39] |
Raffel C , Shazeer N , Roberts A , Lee K , Narang S , Matena M , Zhou Y , Li W , Liu PJ . Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res, 2020, 21(1): 5485–5551
|
[40] |
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Sutskever I (2021) Zero-shot text-to-image generation. Proceedings of the 38th International Conference on Machine Learning. pp. 8821−8831
|
[41] |
Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684−10695
|
[42] |
Scao TL, Fan A, Akiki C, Pavlick E, Ili ́c S, Hesslow D, Castagné R, Luccioni AS, Yvon F, Gallé M, Tow J, Rush AM, Biderman S, Webson A, Ammanamanchi PS, Wang T, Sagot B, Muennighoff N, Moral AV, Ruwase O, Bawden R, Bekman S, Major AM, Wolf T, Beltagy I, Nguyen H, Saulnier L, Tan S, Suarez PO, Sanh V, Lauren ̧con H, Jernite Y, Launay J, Mitchell M, Raffel C (2022) BLOOM: a 176b-parameter open-access multilingual language model. arXiv: 2211.05100.
|
[43] |
Sharma D, Purushotham S, Reddy CK (2021) MedFuseNet: an attention-based multimodal deep learning model for visual question answering in the medical domain. Sci Rep 11(1):19826.
|
[44] |
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Schärli N, Chowdhery A, Mansfield P, Agüera y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V (2022) Large language models encode clinical knowledge. arXiv: 2212.13138.
|
[45] |
Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, Clark K, Pfohl S, Cole-Lewis H, Neal D, Schaekermann M, Wang A, Amin M, Lachgar S, Mansfield P, Prakash S, Green B, Dominowska E, Aguera y Arcas B, Tomasev N, Liu Y, Wong R, Semturs C, Mahdavi SS, Barral J, Webster D, Corrado GS, Matias Y, Azizi S, Karthikesalingam A, Natarajan V (2023) Towards expert-level medical question answering with large language models. arXiv: 2305.09617.
|
[46] |
Tan H, Bansal M (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv: 1908.07490.
|
[47] |
Taylor R, Kardas M, Cucurull G, Scialom T, Hartshorn A, Saravia E, Poulton A, Kerkez V, Stojnic R (2022) Galactica: a large language model for science. arXiv: 2211.09085.
|
[48] |
Thawkar O, Shaker A, Mullappilly SS, Cholakkal H, Anwer RM, Khan S, Laaksonen J, Khan FS (2023) XrayGPT: chest radiographs summarization using large medical vision-language models. arXiv: 2306.07971.
|
[49] |
Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, Cheng HT, Jin A, Bos T, Baker L, Du Y, Li Y, Lee H, Zheng HS, Ghafouri A, Menegali M, Huang Y, Krikun M, Lepikhin D, Qin J, Chen D, Xu Y, Chen Z, Roberts A, Bosma M, Zhao V, Zhou Y, Chang CC, Krivokon I, Rusch W, Pickett M, Srinivasan P, Man L, Meier-Hellstern K, Morris MR, Doshi T, Delos Santos R, Duke T, Soraker J, Zevenbergen B, Prabhakaran V, Diaz M, Hutchinson B, Olson K, Molina A, Hoffman-John E, Lee J, Aroyo L, Rajakumar R, Butryna A, Lamm M, Kuzmina V, Fenton J, Cohen A, Bernstein R, Kurzweil R, Aguera-Arcas B, Cui C, Croak M, Chi E, Le Q (2022) Lamda: language models for dialog applications. arXiv: 2201.08239.
|
[50] |
Tian Y, Gan R, Song Y, Zhang J, Zhang Y (2023) CHIMED-GPT: a chinese medical large language model with full training regime and better alignment to human preferences. arXiv: 2311.06025.
|
[51] |
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, and Lample G (2023) Llama: open and efficient foundation language models. arXiv: 2302.13971.
|
[52] |
Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang PC, Carroll A, Lau C, Tanno R, Ktena I, Mustafa B, Chowdhery A, Liu Y, Kornblith S, Fleet D, Mansfield P, Prakash S, Wong R, Virmani S, Semturs C, Mahdavi SS, Green B, Dominowska E, Aguera y Arcas B, Barral J, Webster D, Corrado GS, Matias Y, Singhal K, Florence P, Karthikesalingam A, Natarajan V (2023) Towards generalist biomedical AI. arXiv: 2307.14334.
|
[53] |
Wang G, Yang G, Du Z, Fan L, Li X (2023a) ClinicalGPT: large language models finetuned with diverse medical data and comprehensive evaluation. arXiv: 2306.09968.
|
[54] |
Wang Z, Wu Z, Agarwal D, Sun J (2023b) MedCLIP: contrastive learning from unpaired medical images and text. arXiv: 2210.10163.
|
[55] |
Wei J , Wang X , Schuurmans D , Bosma M , Xia F , Chi E , Le Q , Zhou D . Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst, 2022, 35: 24824–24837
|
[56] |
Wu C, Lin W, Zhang X, Zhang Y, Wang Y, Xie W (2023a) PMC-LLaMA: an open-source language model for medical applications. arXiv: 2304.14454.
|
[57] |
Wu S, Fei H, Qu L, Ji W, Chua TS (2023b) NExT-GPT: any-to-any multimodal LLM. arXiv: 2309.05519.
|
[58] |
Wu Y, Wang S, Yang H, Zheng T, Zhang H, Zhao Y, Qin B (2023c) An early evaluation of gpt-4v(ision). arXiv: 2310.16534.
|
[59] |
Xu H, Ghosh G, Huang PY, Arora P, Aminzadeh M, Feichtenhofer C, Metze F, Zettlemoyer L (2021). Vlm: task-agnostic video-language model pre-training for video understanding. arXiv: 2105.09996.
|
[60] |
Xu M (2023) MedicalGPT: training medical GPT models.
|
[61] |
Yasunaga M , Bosselut A , Ren H , Zhang X , Manning CD , Liang PS , Leskovec J . Deep bidirectional language-knowledge graph pretraining. Adv Neural Inf Process Syst, 2022a, 35: 37309–37323
|
[62] |
Yasunaga M, Leskovec J, Liang P (2022b) LinkBERT: pretraining language models with document links. arXiv: 2203.15827.
|
[63] |
Ye F, Liu G, Wu X, Wu L (2023) AltDiffusion: a multilingual text-to-image diffusion model. arXiv: 2308.09991.
|
[64] |
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6281−6290
|
[65] |
Zhan LM, Liu B, Fan L, Chen J, Wu XM (2020) Medical visual question answering via conditional reasoning. In Proceedings of the 28th ACM International Conference on Multimedia. pp. 2345−2354
|
[66] |
Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M, Li X, Lin XV, Mihaylov T, Ott M, Shleifer S, Simig D, Koura PS, Sridhar A, Wang T, Zettlemoyer L (2022) OPT: open pre-trained transformer language models. arXiv: 2205.01068.
|
[67] |
Zhang S, Xu Y, Usuyama N, Bagga J, Tinn R, Preston S, Rao R, Wei M, Valluri N, Wong C, Lungren MP, Naumann T, Poon H (2023) Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv: 2303.00915.
|
[68] |
Zhao H, Cai Z, Si S, Ma X, An K, Chen L, Liu Z, Wang S, Han W, Chang B (2023) MMICL: empowering vision-language model with multi-modal in-context learning. arXiv: 2309.07915.
|
[69] |
Zhu D, Chen J, Shen X, Li X, Elhoseiny M (2023) MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv: 2304.10592.
|
/
〈 |
|
〉 |