Large sequence models for sequential decision-making: a survey

Muning WEN; Runji LIN; Hanjing WANG; Yaodong YANG; Ying WEN; Luo MAI; Jun WANG; Haifeng ZHANG; Weinan ZHANG

doi:10.1007/s11704-023-2689-5

PDF(2853 KB)

Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (6) : 176349. DOI: 10.1007/s11704-023-2689-5

Excellent Young Computer Scientists Forum

REVIEW ARTICLE

Large sequence models for sequential decision-making: a survey

Muning WEN¹^,² ,
Runji LIN³^,⁴ ,
Hanjing WANG¹^,² ,
Yaodong YANG⁵ ,
Ying WEN¹ ,
Luo MAI⁶ ,
Jun WANG²^,⁷ ,
Haifeng ZHANG³^,⁴ ,
Weinan ZHANG¹

Author information +

History +

Abstract

Transformer architectures have facilitated the development of large-scale and general-purpose sequence models for prediction tasks in natural language processing and computer vision, e.g., GPT-3 and Swin Transformer. Although originally designed for prediction problems, it is natural to inquire about their suitability for sequential decision-making and reinforcement learning problems, which are typically beset by long-standing issues involving sample efficiency, credit assignment, and partial observability. In recent years, sequence models, especially the Transformer, have attracted increasing interest in the RL communities, spawning numerous approaches with notable effectiveness and generalizability. This survey presents a comprehensive overview of recent works aimed at solving sequential decision-making tasks with sequence models such as the Transformer, by discussing the connection between sequential decision-making and sequence modeling, and categorizing them based on the way they utilize the Transformer. Moreover, this paper puts forth various potential avenues for future research intending to improve the effectiveness of large sequence models for sequential decision-making, encompassing theoretical foundations, network architectures, algorithms, and efficient training systems.

Graphical abstract

Keywords

sequential decision-making / sequence modeling / the Transformer / training system

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Muning WEN, Runji LIN, Hanjing WANG, Yaodong YANG, Ying WEN, Luo MAI, Jun WANG, Haifeng ZHANG, Weinan ZHANG. Large sequence models for sequential decision-making: a survey. Front. Comput. Sci., 2023, 17(6): 176349 https://doi.org/10.1007/s11704-023-2689-5

This is a preview of subscription content, contact us for subscripton.

Muning Wen is currently working toward his PhD degree at Shanghai Jiao Tong University, China. His research interests include reinforcement learning and multi-agent system. He has been serving as a reviewer at NeurIPS

Runji Lin is currently pursuing his MSc degree at the School of Artificial Intelligence, University of Chinese Academy of Sciences, China. His research interests include reinforcement learning, multi-agent system, and game theory

Hanjing Wang is currently a PhD Candidate of Shanghai Jiao Tong University, China. His research interests include scalable reinforcement learning and machine learning systems

Yaodong Yang is currently an assistant professor at Peking University, China. His research is about reinforcement learning and multi-agent systems. He has maintained a track record of more than forty publications at top conferences and journals, along with the best system paper award at CoRL 2020 and the best blue-sky paper award at AAMAS 2021. Before joining Peking University, he was an assistant professor at King’s College London. Before KCL, he was a principal research scientist at Huawei UK

Ying Wen is a tenure-track assistant professor at John Hopcroft Center for Computer Science at Shanghai Jiao Tong University, China. His research interests include machine learning, multi-agent systems and human-centered interactive systems, etc. He has been serving as a PC member at ICML, NeurIPS, ICLR, AAAI, IJCAI, ICAPS and a reviewer at TIFS, Operational Research, etc. He was granted Best Paper Award in AAMAS 2021 Blue Sky Track and Best System Paper Award in CoRL 2020

Luo Mai is an assistant professor (UK Lecturer) in the School of Informatics at the University of Edinburgh, UK. He is a member of the Institute of Computing Systems Architecture where he is leading the Edinburgh System-X Group. His research group designs scalable, adaptive and efficient system software to support emerging data-centric applications and utilize novel computing platforms

Jun Wang is a chair professor of Computer Science at University College London, UK, and the founding director of MSc Web Science and Big Data Analytics. His main research interests are in the areas of AI and intelligent systems, including (multi-agent) reinforcement learning, deep generative models, and their diverse applications on information retrieval, recommender systems and personalization, data mining, smart cities, bot planning, computational advertising etc. He has served as an Area Chair in ACM CIKM and ACM SIGIR

Haifeng Zhang is an associate professor at the Institute of Automation, Chinese Academy of Sciences (CASIA), China. His research areas include reinforcement learning, game AI, game theory and computational advertising. He has published research papers on international conferences ICML, NeurIPS, AAAI, IJCAI, AAMAS etc. He has served as a reviewer for AAAI, IJCAI, TNNLS, Acta Automatica Sinica, and co-chair for IJCAI competition, IJTCS, DAI Workshop, etc

Weinan Zhang is now an associate professor at Shanghai Jiao Tong University, China. His research interests include reinforcement learning, deep learning and data science with various real-world applications. He has published over 150 research papers on international conferences and journals and has been serving as an area chair or (senior) PC member at ICML, NeurIPS, ICLR, KDD, AAAI, IJCAI, SIGIR, etc., and a reviewer at JMLR, TOIS, TKDE, TIST, etc

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Liu X, Zhang F, Hou Z, Mian L, Wang Z, Zhang J, Tang J . Self-supervised learning: generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 2023, 35( 1): 857–876

[2]	Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 3104−3112

[3]	Krizhevsky A, Sutskever I, Hinton G E . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90

[4]	Qin C, Zhang A, Zhang Z, Chen J, Yasunaga M, Yang D. Is ChatGPT a general-purpose natural language processing task solver? 2023, arXiv preprint arXiv: 2302.06476

[5]	Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 9992−10002

[6]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010

[7]	Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. Cambridge: MIT Press, 2018

[8]	Reed S, Zolna K, Parisotto E, Colmenarejo S G, Novikov A, Barth-Maron G, Gimenez M, Sulsky Y, Kay J, Springenberg J T, Eccles T, Bruce J, Razavi A, Edwards A, Heess N, Chen Y, Hadsell R, Vinyals O, Bordbar M, de Freitas N. A generalist agent. 2022, arXiv preprint arXiv: 2205.06175

[9]	Baker B, Akkaya I, Zhokhov P, Huizinga J, Tang J, Ecoffet A, Houghton B, Sampedro R, Clune J. Video PreTraining (VPT): learning to act by watching unlabeled online videos. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

[10]	Yang S, Nachum O, Du Y, Wei J, Abbeel P, Schuurmans D. Foundation models for decision making: problems, methods, and opportunities. 2023, arXiv preprint arXiv: 2303.04129

[11]	Kruse R, Mostaghim S, Borgelt C, Braune C, Steinbrecher M. Multi-layer perceptrons. In: Kruse R, Mostaghim S, Borgelt C, Braune C, Steinbrecher M, eds. Computational Intelligence: A Methodological Introduction. 3rd ed. Cham: Springer, 2022, 53−124

[12]	LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, Jackel L D . Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1( 4): 541–551

[13]	Sarker I H . Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2021, 2( 6): 420

[14]	Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge: MIT Press, 2016

[15]	Hochreiter S, Schmidhuber J . Long short-term memory. Neural Computation, 1997, 9( 8): 1735–1780

[16]	Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1724−1734

[17]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186

[18]

Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1877−1901

[19]

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[20]

Silver D, Huang A, Maddison C J, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D . Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529( 7587): 484–489

[21]

Vinyals O, Babuschkin I, Czarnecki W M, Mathieu M, Dudzik A, Chung J, Choi D H, Powell R, Ewalds T, Georgiev P, Oh J, Horgan D, Kroiss M, Danihelka I, Huang A, Sifre L, Cai T, Agapiou J P, Jaderberg M, Vezhnevets A S, Leblond R, Pohlen T, Dalibard V, Budden D, Sulsky Y, Molloy J, Paine T L, Gulcehre C, Wang Z Y, Pfaff T, Wu Y H, Ring R, Yogatama D, Wünsch D, Mckinney K, Smith O, Schaul T, Lillicrap T, Kavukcuoglu K, Hassabis D, Apps C, Silver D . Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 2019, 575( 7782): 350–354

[22]	Sutton R S . Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3( 1): 9–44

[23]	Williams R J . Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8( 3): 229–256

[24]

Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529–533

[25]	Konda V R, Tsitsiklis J N. Actor-critic algorithms. In: Proceedings of the 13th Conference on Neural Information Processing Systems. 1999

[26]	Camacho E F, Alba C B. Model Predictive Control. Advanced Textbooks in Control and Signal Processing. Springer London, 2013

[27]	Peng B, Li X, Gao J, Liu J, Wong K F, Su S Y. Deep Dyna-Q: integrating planning for task-completion dialogue policy learning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2182−2192

[28]	Botvinick M, Ritter S, Wang J X, Kurth-Nelson Z, Blundell C, Hassabis D . Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 2019, 23( 5): 408–422

[29]	Sutton R S. Temporal credit assignment in reinforcement learning. University of Massachusetts Amherst, Dissertation, 1984

[30]	Hausknecht M J, Stone P. Deep recurrent q-learning for partially observable MDPs. In: Proceedings of 2015 AAAI Fall Symposium Series. 2015, 29−37

[31]	McFarlane R. A survey of exploration strategies in reinforcement learning. McGill University, 2018

[32]	Hao J, Yang T, Tang H, Bai C, Liu J, Meng Z, Liu P, Wang Z. Exploration in deep reinforcement learning: from single-agent to multiagent domain. 2021, arXiv preprint arXiv: 2109.06668

[33]

Zhou M, Luo J, Villella J, Yang Y, Rusu D, Miao J, Zhang W, Alban M, Fadakar I, Chen Z, Huang A C, Wen Y, Hassanzadeh K, Graves D, Chen D, Zhu Z, Nguyen N, Elsayed M, Shao K, Ahilan S, Zhang B, Wu J, Fu Z, Rezaee K, Yadmellat P, Rohani M, Nieves N P, Ni Y, Banijamali S, Rivers A C, Tian Z, Palenicek D, bou Ammar H, Zhang H, Liu W, Hao J, Wang J. SMARTS: scalable multi-agent reinforcement learning training school for autonomous driving. In: Proceedings of the Conference on Robot Learning. 2020

[34]	Qin R J, Zhang X, Gao S, Chen X H, Li Z, Zhang W, Yu Y. NeoRL: a near real-world benchmark for offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

[35]	Jakobi N, Husbands P, Harvey I. Noise and the reality gap: the use of simulation in evolutionary robotics. In: Proceedings of the 3rd European Conference on Artificial Life. 1995, 704−720

[36]	Harutyunyan A, Dabney W, Mesnard T, Heess N, Azar M G, Piot B, van Hasselt H, Singh S, Wayne G, Precup D, Munos R. Hindsight credit assignment. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1120

[37]	Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. High-dimensional continuous control using generalized advantage estimation. 2015, arXiv preprintarXiv: 1506.02438

[38]	Oliehoek F A, Amato C. A Concise Introduction to Decentralized POMDPs. Cham: Springer, 2016

[39]	Torabi F, Warnell G, Stone P. Behavioral cloning from observation. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 4950−4957

[40]	Ho J, Ermon S. Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4572−4580

[41]	Jang E, Irpan A, Khansari M, Kappler D, Ebert F, Lynch C, Levine S, Finn C. BC-Z: zero-shot task generalization with robotic imitation learning. In: Proceedings of the Conference on Robot Learning. 2021, 991−1002

[42]	Interactive Agents Team. Creating multimodal interactive agents with imitation and self-supervised learning. 2021, arXiv preprint arXiv: 2112.03763

[43]	Srivastava R K, Shyam P, Mutz F, Jaśkowski W, Schmidhuber J . Training agents using upside-down reinforcement learning. 2019, arXiv preprint arXiv: 1912, 0287, 7:

[44]	Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I. Decision transformer: reinforcement learning via sequence modeling. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 15084−15097

[45]	Janner M, Li Q, Levine S. Offline reinforcement learning as one big sequence modeling problem. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 1273−1286

[46]	Cang C, Hakhamaneshi K, Rudes R, Mordatch I, Rajeswaran A, Abbeel P, Laskin M. Semi-supervised offline reinforcement learning with pre-trained decision transformers. In: Proceedings of the International Conference on Learning Representations. 2022

[47]	Wang Z, Chen C, Dong D . Lifelong incremental reinforcement learning with online Bayesian inference. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33( 8): 4003–4016

[48]	Wang Z, Chen C, Dong D. A dirichlet process mixture of robust task models for scalable lifelong reinforcement learning. IEEE Transactions on Cybernetics, 2022, doi: 10.1109/TCYB.2022.3170485

[49]	Zheng Q, Zhang A, Grover A. Online decision transformer. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 27042−27059

[50]	Meng L, Wen M, Yang Y, Le C, Li X, Zhang W, Wen Y, Zhang H, Wang J, Xu B. Offline pre-trained multi-agent decision transformer: one big sequence model tackles all SMAC tasks. 2021, arXiv preprint arXiv: 2112.02845

[51]	Fan L, Wang G, Jiang Y, Mandlekar A, Yang Y, Zhu H, Tang A, Huang D A, Zhu Y, Anandkumar A. MINEDOJO: building open-ended embodied agents with internet-scale knowledge. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

[52]	Hu S, Zhu F, Chang X, Liang X. UPDeT: universal multi-agent reinforcement learning via policy decoupling with transformers. 2021, arXiv preprint arXiv: 2101.08001

[53]	Zhou T, Zhang F, Shao K, Li K, Huang W, Luo J, Wang W, Yang Y, Mao H, Wang B, Li D, Liu W, Hao J. Cooperative multi-agent transfer learning with level-adaptive credit assignment. 2021, arXiv preprint arXiv: 2106.00517

[54]	Wen M, Kuba J G, Lin R, Zhang W, Wen Y, Wang J, Yang Y. Multi-agent reinforcement learning is a sequence modeling problem. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 16509−16521

[55]	Lee K H, Nachum O, Yang M, Lee L, Freeman D, Xu W, Guadarrama S, Fischer I, Jang E, Michalewski H, Mordatch I. Multi-game decision transformers. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

[56]	Xu M, Shen Y, Zhang S, Lu Y, Zhao D, Tenenbaum J B, Gan C. Prompting decision transformer for few-shot policy generalization. In: Proceedings of the International Conference on Machine Learning. 2022, 24631−24645

[57]	Ferret J, Marinier R, Geist M, Pietquin O. Selfattentional credit assignment for transfer in reinforcement learning. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2021, 368

[58]

Mesnard T, Weber T, Viola F, Thakoor S, Saade A, Harutyunyan A, Dabney W, Stepleton T S, Heess N, Guez A, Moulines E, Hutter M, Buesing L, Munos R. Counterfactual credit assignment in model-free reinforcement learning. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 7654−7664

[59]	Furuta H, Matsuo Y, Gu S S. Generalized decision transformer for offline hindsight information matching. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[60]	Melo L C. Transformers are meta-reinforcement learners. In: Proceedings of the International Conference on Machine Learning. 2022, 15340−15359

[61]	Fu W, Yu C, Xu Z, Yang J, Wu Y. Revisiting some common practices in cooperative multi-agent reinforcement learning. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 6863−6877

[62]	Wang K, Zhao H, Luo X, Ren K, Zhang W, Li D. Bootstrapped transformer for offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

[63]	Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 1204−1213

[64]	Goyal P, Caron M, Lefaudeux B, Xu M, Wang P, Pai V, Singh M, Liptchinsky V, Misra I, Joulin A, Bojanowski P. Self-supervised pretraining of visual features in the wild. 2021, arXiv preprint arXiv: 2103.01988

[65]	Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763

[66]	Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I. Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8821−8831

[67]	Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser Ł. Universal transformers. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[68]	Wang W, Bao H, Dong L, Bjorck J, Peng Z, Liu Q, Aggarwal K, Mohammed O K, Singhal S, Som S, Wei F. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. 2022, arXiv preprint arXiv: 2208.10442

[69]	Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. 2021, arXiv preprint arXiv: 2101.03961

[70]	Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N. Big transfer (BiT): general visual representation learning. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 491−507

[71]	Kaplan J, McCandlish S, Henighan T, Brown T B, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. 2020, arXiv preprint arXiv: 2001.08361

[72]	Kharitonov E, Chaabouni R. What they do when in doubt: a study of inductive biases in seq2seq learners.2020, arXiv preprint arXiv: 2006.14953

[73]	Edelman B L, Goel S, Kakade S, Zhang C. Inductive biases and variable creation in self-attention mechanisms. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 5793−5831

[74]	Ghorbani B, Firat O, Freitag M, Bapna A, Krikun M, Garcia X, Chelba C, Cherry C. Scaling laws for neural machine translation. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[75]	Shen H. Mutual information scaling and expressive power of sequence models. 2019, arXiv preprint arXiv: 1905.04271

[76]	Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1310−1318

[77]

Olsson C, Elhage N, Nanda N, Joseph N, DasSarma N, Henighan T, Mann B, Askell A, Bai Y, Chen A, Conerly T, Drain D, Ganguli D, Hatfield-Dodds Z, Hernandez D, Johnston S, Jones A, Kernion J, Lovitt L, Ndousse K, Amodei D, Brown T, Clark J, Kaplan J, McCandlish S, Olah C. In-context learning and induction heads. 2022, arXiv preprint arXiv:2209.11895

[78]	Wei C, Chen Y, Ma T. Statistically meaningful approximation: a case study on approximating turing machines with transformers. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

[79]	Pérez J, Marinković J, Barceló P. On the Turing completeness of modern neural network architectures. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[80]	Levine S, Kumar A, Tucker G, Fu J. Offline reinforcement learning: tutorial, review, and perspectives on open problems. 2020, arXiv preprint arXiv: 2005.01643

[81]	Li L . A perspective on off-policy evaluation in reinforcement learning. Frontiers of Computer Science, 2019, 13( 5): 911–912

[82]	Moerland T M, Broekens J, Plaat A, Jonker C M . Model-based reinforcement learning: a survey. Foundations and Trends^® in Machine Learning, 2023, 16( 1): 1–118

[83]	Chen C, Wu Y F, Yoon J, Ahn S. TransDreamer: reinforcement learning with transformer world models. 2022, arXiv preprint arXiv: 2202.09481

[84]	Zeng C, Docter J, Amini A, Gilitschenski I, Hasani R, Rus D. Dreaming with transformers. In: Proceedings of the AAAI Workshop on Reinforcement Learning in Games. 2022

[85]	Hafner D, Lillicrap T P, Ba J, Norouzi M. Dream to control: learning behaviors by latent imagination. In: Proceedings of the 8th International Conference on Learning Representations. 2020

[86]	Kaelbling L P. Learning to achieve goals. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence. 1993, 1094−1099

[87]	Rudner T G J, Pong V H, McAllister R, Gal Y, Levine S. Outcome-driven reinforcement learning via variational inference. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 13045−13058

[88]	Liu M, Zhu M, Zhang W. Goal-conditioned reinforcement learning: problems and solutions. In: Proceedings of the 31st International Joint Conference on Artificial Intelligence. 2022, 5502−5511

[89]	Carroll M, Lin J, Paradise O, Georgescu R, Sun M, Bignell D, Milani S, Hofmann K, Hausknecht M, Dragan A, Devlin S. Towards flexible inference in sequential decision problems via bidirectional transformers. 2022, arXiv preprint arXiv: 2204.13326

[90]	Putterman A L, Lu K, Mordatch I, Abbeel P. Pretraining for language-conditioned imitation with transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021

[91]

Open Ended Learning Team, Stooke A, Mahajan A, Barros C, Deck C, Bauer J, Sygnowski J, Trebacz M, Jaderberg M, Mathieu M, McAleese N, Bradley-Schmieg N, Wong N, Porcel N, Raileanu R, Hughes-Fitt S, Dalibard V, Czarnecki W M. Open-ended learning leads to generally capable agents. 2021, arXiv preprint arXiv: 2107.12808

[92]

Ahn M, Brohan A, Brown N, Chebotar Y, Cortes O, David B, Finn C, Fu C, Gopalakrishnan K, Hausman K, Herzog A, Ho D, Hsu J, Ibarz J, Ichter B, Irpan A, Jang E, Ruano R J, Jeffrey K, Jesmonth S, Joshi N J, Julian R, Kalashnikov D, Kuang Y, Lee K H, Levine S, Lu Y, Luu L, Parada C, Pastor P, Quiambao J, Rao K, Rettinghouse J, Reyes D, Sermanet P, Sievers N, Tan C, Toshev A, Vanhoucke V, Xia F, Xiao T, Xu P, Xu S, Yan M, Zeng A. Do as I can, not as I say: grounding language in robotic affordances. 2022, arXiv preprint arXiv: 2204.01691

[93]	Shah D, Osiński B, Ichter B, Levine S. LM-Nav: robotic navigation with large pre-trained models of language, vision, and action. In: Proceedings of the 6th Conference on Robot Learning. 2023, 492−504

[94]

Huang W, Xia F, Xiao T, Chan H, Liang J, Florence P, Zeng A, Tompson J, Mordatch I, Chebotar Y, Sermanet P, Jackson T, Brown N, Luu L, Levine S, Hausman K, Ichter B. Inner monologue: embodied reasoning through planning with language models. In: Proceedings of the Conference on Robot Learning. 2022, 1769−1782

[95]	Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 149

[96]	He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 9726−9735

[97]	Levine S. Understanding the world through action. In: Proceedings of the 5th Conference on Robot Learning. 2022, 1752−1757

[98]	Krueger D, Maharaj T, Leike J. Hidden incentives for auto-induced distributional shift. 2020, arXiv preprint arXiv: 2009.09153

[99]	Kumar A, Fu J, Tucker G, Levine S. Stabilizing off-policy Q-learning via bootstrapping error reduction. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1055

[100]

Kaspar M, Osorio J D M, Bock J. Sim2Real transfer for reinforcement learning without dynamics randomization. In: Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2020, 4383−4388

[101]

Tancik M, Casser V, Yan X, Pradhan S, Mildenhall B P, Srinivasan P, Barron J T, Kretzschmar H. Block-NeRF: scalable large scene neural view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 8238−8248

[102]

Nair A, Gupta A, Dalal M, Levine S. AWAC: accelerating online reinforcement learning with offline datasets. 2020, arXiv preprint arXiv: 2006.09359

[103]

Mao Y, Wang C, Wang B, Zhang C. MOORe: model-based offline-to-online reinforcement learning. 2022, arXiv preprint arXiv: 2201.10070

[104]

Zhou Z H . Rehearsal: learning from prediction to decision. Frontiers of Computer Science, 2022, 16( 4): 164352

[105]

Huang W, Abbeel P, Pathak D, Mordatch I. Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In: Proceedings of the International Conference on Machine Learning. 2022, 9118−9147

[106]

Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018, arXiv preprint arXiv: 1803.01271

[107]

Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A. MLP-mixer: an all-MLP architecture for vision. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 24261−24272

[108]

Jaegle A, Borgeaud S, Alayrac J B, Doersch C, Ionescu C, Ding D, Koppula S, Zoran D, Brock A, Shelhamer E, Hénaff O J, Botvinick M M, Zisserman A, Vinyals O, Carreira J. Perceiver IO: a general architecture for structured inputs & outputs. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[109]

Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q V, Hinton G E, Dean J. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. In: Proceedings of the 5th International Conference on Learning Representations. 2017

[110]

Yang R, Xu H, Wu Y, Wang X. Multi-task reinforcement learning with soft modularization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 400

[111]

Fernando C, Banarse D, Blundell C, Zwols Y, Ha D, Rusu A A, Pritzel A, Wierstra D. PathNet: evolution channels gradient descent in super neural networks. 2017, arXiv preprint arXiv: 1701.08734

[112]

Lepikhin D, Lee H, Xu Y, Chen D, Firat O, Huang Y, Krikun M, Shazeer N, Chen Z. GShard: scaling giant models with conditional computation and automatic sharding. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[113]

Rajbhandari S, Li C, Yao Z, Zhang M, Aminabadi R Y, Awan A A, Rasley J, He Y. DeepSpeed-MoE: advancing mixture-of-experts inference and training to power next-generation AI scale. In: Proceedings of the International Conference on Machine Learning. 2022, 18332−18346

[114]

Huang Y, Cheng Y, Bapna A, Firat O, Chen M X, Chen D, Lee H, Ngiam J, Le Q V, Wu Y, Chen Z F. GPipe: efficient training of giant neural networks using pipeline parallelism. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 10

[115]

Li S, Fang J, Bian Z, Liu H, Liu Y, Huang H, Wang B, You Y. Colossal-AI: a unified deep learning system for large-scale parallel training. 2021, arXiv preprint arXiv: 2110.14883

[116]

Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Doron Y, Firoiu V, Harley T, Dunning I, Legg S, Kavukcuoglu K. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1406−1415

[117]

Espeholt L, Marinier R, Stanczyk P, Wang K, Michalski M. SEED RL: scalable and efficient deep-RL with accelerated central inference. In: Proceedings of the 8th International Conference on Learning Representations. 2020

[118]

Ozbulak U, Lee H J, Boga B, Anzaku E T, Park H, Van Messem A, De Neve W, Vankerschaver J. Know your self-supervised learning: A survey on image-based generative and discriminative training. 2023, arXiv preprint arXiv: 2305.13689

[119]

Carroll M, Paradise O, Lin J, Georgescu R, Sun M, Bignell D, Milani S, Hofmann K, Hausknecht M, Dragan A, Devlin S. UniMASK: unified inference in sequential decision problems. 2022, arXiv preprint arXiv: 2211.10869

[120]

Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022

Acknowledgements

The SJTU team was partially supported by “New Generation of AI 2030” Major Project (2018AAA0100900), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and the National Natural Science Foundation of China (Grant No. 62076161). Muning Wen is supported by Wu Wen Jun Honorary Scholarship, AI Institute, Shanghai Jiao Tong University.