Model gradient: unified model and policy learning in model-based reinforcement learning
Chengxing JIA, Fuxiang ZHANG, Tian XU, Jing-Cheng PANG, Zongzhang ZHANG, Yang YU
Model gradient: unified model and policy learning in model-based reinforcement learning
Model-based reinforcement learning is a promising direction to improve the sample efficiency of reinforcement learning with learning a model of the environment. Previous model learning methods aim at fitting the transition data, and commonly employ a supervised learning approach to minimize the distance between the predicted state and the real state. The supervised model learning methods, however, diverge from the ultimate goal of model learning, i.e., optimizing the learned-in-the-model policy. In this work, we investigate how model learning and policy learning can share the same objective of maximizing the expected return in the real environment. We find model learning towards this objective can result in a target of enhancing the similarity between the gradient on generated data and the gradient on the real data. We thus derive the gradient of the model from this target and propose the Model Gradient algorithm (MG) to integrate this novel model learning approach with policy-gradient-based policy optimization. We conduct experiments on multiple locomotion control tasks and find that MG can not only achieve high sample efficiency but also lead to better convergence performance compared to traditional model-based reinforcement learning approaches.
reinforcement learning / model-based reinforcement learning / Markov decision process
Chengxing Jia received a BSc degree from Shandong University, China in 2020. He is working towards a PhD in the National Key Lab for Novel Software Technology at the School of Artificial Intelligence, Nanjing University, China. His research interests include model-based reinforcement learning, offline reinforcement learning and meta reinforcement learning. He has served as the reviewer for NeurIPS, ICML, etc
Fuxiang Zhang received a BSc degree from Nanjing University, China in 2021. He is pursuing a master degree in the National Key Lab for Novel Software Technology, Nanjing University, China. His research interests include offline reinforcement learning and multi-agent reinforcement learning
Tian Xu received a BSc degree from Northwestern Polytechnical University, China in 2019. He is working towards a PhD in the National Key Lab for Novel Software Technology at the School of Artificial Intelligence, Nanjing University, China. His research interests include imitation learning and model-based RL
Jing-Cheng Pang received a BSc degree from the University of Electronic Science and Technology of China, China in 2019. He has been pursuing his PhD with the School of Artificial Intelligence at Nanjing University, China, since 2019. His current research interests include reinforcement learning and machine learning. He has served as the reviewer for TETCI, ICML, UAI, etc. He is also a student committee member of RLChina Community
Zongzhang Zhang received his PhD degree in computer science from University of Science and Technology of China, China in 2012. He was a research fellow at the School of Computing, National University of Singapore, Singapore from 2012 to 2014, and a visiting scholar at the Department of Aeronautics and Astronautics, Stanford University, USA from 2018 to 2019. He is currently an associate professor at the School of Artificial Intelligence, Nanjing University, China. He has co-authored more than 60 research papers. His research interests include reinforcement learning, intelligent planning, and multi-agent learning
Yang Yu received his BSc and PhD degrees in computer science from Nanjing University, China in 2004 and 2011, respectively. Currently, he is a professor at the School of Artificial Intelligence, Nanjing University, China. His research interests include machine learning, and he is currently working on real-world reinforcement learning. His work has been published in Artificial Intelligence, IJCAI, AAAI, NIPS, KDD, etc. He received several conference best paper awards, including IDEAL16, GECCO11, PAKDD08, etc. He also received CCF-IEEE CS Young Scientist Award in 2020, was recognized as one of the AIs 10 to Watch by IEEE Intelligent Systems in 2018, and received the PAKDD Early Career Award in 2018. He was invited to give an Early Career Spotlight Talk in IJCAI18
[1] |
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. The MIT Press, 2018
|
[2] |
Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D . Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529( 7587): 484–489
|
[3] |
Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M A, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529–533
|
[4] |
Brafman R I, Tennenholtz M . R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. The Journal of Machine Learning Research, 2002, 3: 213–231
|
[5] |
Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T, Lillicrap T, Silver D . Mastering Atari, Go, Chess and Shogi by planning with a learned model. Nature, 2020, 588( 7839): 604–609
|
[6] |
Luo Y, Xu H, Li Y, Tian Y, Darrell T, Ma T. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. In: Proceedings of International Conference on Learning Representations. 2019
|
[7] |
Janner M, Fu J, Zhang M, Levine S. When to trust your model: Model-based policy optimization. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 1122
|
[8] |
Pan F, He J, Tu D, He Q. Trust the model when it is confident: Masked model-based actor-critic. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
|
[9] |
Talvitie E. Self-correcting models for model-based reinforcement learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2597−2603
|
[10] |
Kidambi R, Rajeswaran A, Netrapalli P, Joachims T. MOReL: model-based offline reinforcement learning. 2020, arXiv preprint arXiv: 2005.05951
|
[11] |
Yu T, Thomas G, Yu L, Ermon S, Zou J Y, Levine S, Finn C, Ma T. MOPO: model-based offline policy optimization. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
|
[12] |
Asadi K, Misra D, Kim S, Littman M L. Combating the compounding-error problem with a multi-step model. 2019, arXiv preprint arXiv: 1905.13320
|
[13] |
Williams R J . Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8( 3): 229–256
|
[14] |
Sutton R S, McAllester D A, Singh S P, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057−1063
|
[15] |
Kaelbling L P, Littman M L, Moore A W . Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 1996, 4: 237–285
|
[16] |
Hewing L, Wabersich K P, Menner M, Zeilinger M N . Learning-based model predictive control: toward safe learning in control. Annual Review of Control, Robotics, and Autonomous Systems, 2020, 3( 1): 269–296
|
[17] |
Ko J, Fox D . GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 2009, 27( 1): 75–90
|
[18] |
Deisenroth M P, Fox D, Rasmussen C E . Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37( 2): 408–423
|
[19] |
Levine S, Koltun V. Guided policy search. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 1−9
|
[20] |
Lioutikov R, Paraschos A, Peters J, Neumann G. Sample-based informationl-theoretic stochastic optimal control. In: Proceedings of 2014 IEEE International Conference on Robotics and Automation. 2014, 3896−3902
|
[21] |
Kumar V, Todorov E, Levine S. Optimal control with learned local models: Application to dexterous manipulation. In: Proceedings of 2016 IEEE International Conference on Robotics and Automation. 2016, 378−383
|
[22] |
Amos B, Rodriguez I D J, Sacks J, Boots B, Kolter J Z. Differentiable MPC for end-to-end planning and control. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8299−8310
|
[23] |
Khansari-Zadeh S M, Billard A . Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 2011, 27( 5): 943–957
|
[24] |
Kurutach T, Clavera I, Duan Y, Tamar A, Abbeel P. Model-ensemble trust-region policy optimization. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
[25] |
Chua K, Calandra R, McAllister R, Levine S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 4759−4770
|
[26] |
Gu S, Lillicrap T P, Sutskever I, Levine S. Continuous deep Q-learning with model-based acceleration. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning. 2016, 2829−2838
|
[27] |
Clavera I, Rothfuss J, Schulman J, Fujita Y, Asfour T, Abbeel P. Model-based reinforcement learning via meta-policy optimization. In: Proceedings of the 2nd Conference on Robot Learning. 2018, 617−629
|
[28] |
Nagabandi A, Kahn G, Fearing R S, Levine S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 7559−7566
|
[29] |
Asadi K, Misra D, Littman M L. Lipschitz continuity in model-based reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 264−273
|
[30] |
Amos B, Yarats D. The differentiable cross-entropy method. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 28
|
[31] |
Yang J, Petersen B K, Zha H, Faissol D M. Single episode policy transfer in reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2020
|
[32] |
Melo L C. Transformers are meta-reinforcement learners. In: Proceedings of the 39th International Conference on Machine Learning. 2022, 15340−15359
|
[33] |
Parisotto E, Song F, Rae J, Pascanu R, Gülçehre Ç, Jayakumar S M, Jaderberg M, Kaufman R L, Clark A, Noury S, Botvinick M M, Heess N, Hadsell R. Stabilizing transformers for reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 7487−7498
|
[34] |
Grimm C, Barreto A, Singh S, Silver D. The value equivalence principle for model-based reinforcement learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
|
[35] |
Qian H, Yu Y . Derivative-free reinforcement learning: A review. Frontiers of Computer Science, 2021, 15( 6): 156336
|
[36] |
Rajeswaran A, Mordatch I, Kumar V. A game theoretic framework for model based reinforcement learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 737
|
[37] |
Heess N, Wayne G, Silver D, Lillicrap T P, Erez T, Tassa Y. Learning continuous control policies by stochastic value gradients. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2944−2952
|
[38] |
Amos B, Stanton S, Yarats D, Wilson A G. On the model-based stochastic value gradient for continuous reinforcement learning. In: Proceedings of the 3rd Conference on Learning for Dynamics and Control. 2021, 6−20
|
[39] |
Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 2455−2467
|
[40] |
Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J. Learning latent dynamics for planning from pixels. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2555−2565
|
[41] |
Feinberg V, Wan A, Stoica I, Jordan M I, Gonzalez J E, Levine S. Model-based value estimation for efficient model-free reinforcement learning. 2018, arXiv preprint arXiv: 1803.00101
|
[42] |
Buckman J, Hafner D, Tucker G, Brevdo E, Lee H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8234−8244
|
[43] |
Deisenroth M P, Rasmussen C E. PILCO: a model-based and data-efficient approach to policy search. In: Proceedings of the 28th International Conference on International Conference on Machine Learning. 2011, 465−472
|
[44] |
Tamar A, Levine S, Abbeel P, Wu Y, Thomas G. Value iteration networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2154−2162
|
[45] |
Hodel A S . Linear-quadratic control: an introduction [book review]. Proceedings of the IEEE, 1999, 87( 5): 927–928
|
[46] |
Wu C, Li T, Zhang Z, Yu Y. Bayesian optimistic optimization: Optimistic exploration for model-based reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
|
[47] |
Rigter M, Lacerda B, Hawes N. RAMBO-RL: robust adversarial model-based offline reinforcement learning. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022
|
[48] |
Schulman J, Levine S, Moritz P, Jordan M I, Abbeel P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1889−1897
|
[49] |
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv: 1707.06347
|
[50] |
Agarwal A, Kakade S M, Lee J D, Mahajan G . On the theory of policy gradient methods: optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 2021, 22( 1): 98
|
[51] |
Luo F, Xu T, Lai H, Chen X H, Zhang W, Yu Y. A survey on model-based reinforcement learning. 2023, arXiv preprint arXiv: 2206.09328
|
[52] |
Farahmand A M, Barreto A, Nikovski D. Value-aware loss function for model-based reinforcement learning. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1486−1494
|
[53] |
Xu T, Li Z, Yu Y. Error bounds of imitating policies and environments. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020
|
[54] |
Wu Y H, Fan T H, Ramadge P J, Su H. Model imitation for model-based reinforcement learning. In: Proceedings of International Conference on Learning Representations. 2019
|
[55] |
Ho J, Ermon S. Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4572−4580
|
[56] |
Oh J, Singh S, Lee H. Value prediction network. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6120−6130
|
[57] |
Modhe N, Kamath H, Batra D, Kalyan A. Model-advantage and value-aware models for model-based reinforcement learning: bridging the gap in theory and practice. 2021, arXiv preprint arXiv: 2106.14080
|
[58] |
Lovatto  G, Bueno T P, Mauá D D, Barros L N. Decision-aware model learning for actor-critic methods: when theory does not meet practice. In: Proceedings of “I Can’t Believe It’s Not Better!” at NeurIPS Workshops. 2020
|
[59] |
Ayoub A, Jia Z, Szepesvári C, Wang M, Yang L F. Model-based reinforcement learning with value-targeted regression. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 44
|
[60] |
Abachi R, Ghavamzadeh M, Farahmand A M. Policy-aware model learning for policy gradient methods. 2020, arXiv preprint arXiv: 2003.00030
|
[61] |
D’Oro P, Metelli A M, Tirinzoni A, Papini M, Restelli M. Gradient-aware model-based policy search. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 3801−3808
|
[62] |
Nikishin E, Abachi R, Agarwal R, Bacon P L. Control-oriented model-based reinforcement learning with implicit differentiation. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2022, 7886−7894
|
[63] |
Eysenbach B, Khazatsky A, Levine S, Salakhutdinov R. Mismatched no more: joint model-policy optimization for model-based RL. In: Proceedings of Deep RL Workshop NeurIPS 2021. 2021
|
[64] |
Joseph J, Geramifard A, Roberts J W, How J P, Roy N. Reinforcement learning with misspecified model classes. In: Proceedings of 2013 IEEE International Conference on Robotics and Automation. 2013, 939−946
|
[65] |
Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026−5033
|
[66] |
Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, Zhang S, Zhang G, Abbeel P, Ba J. Benchmarking model-based reinforcement learning. 2019, arXiv preprint arXiv: 1907.02057
|
[67] |
Zha D, Ma W, Yuan L, Hu X, Liu J. Rank the episodes: a simple approach for exploration in procedurally-generated environments. In: Proceedings of the 9th International Conference on Learning Representations. 2021
|
/
〈 | 〉 |