LIDAR: learning from imperfect demonstrations with advantage rectification

Xiaoqin ZHANG; Huimin MA; Xiong LUO; Jian YUAN

doi:10.1007/s11704-021-0147-9

PDF(8912 KB)

Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (1) : 161312. DOI: 10.1007/s11704-021-0147-9

Artificial Intelligence

RESEARCH ARTICLE

LIDAR: learning from imperfect demonstrations with advantage rectification

Author information +

History +

Abstract

In actor-critic reinforcement learning (RL) algorithms, function estimation errors are known to cause ineffective random exploration at the beginning of training, and lead to overestimated value estimates and suboptimal policies. In this paper, we address the problem by executing advantage rectification with imperfect demonstrations, thus reducing the function estimation errors. Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain. However, existing methods, such as behavior cloning, often assume the demonstrations contain other information or labels with regard to performances, such as optimal assumption, which is usually incorrect and useless in the real world. In this paper, we explicitly handle imperfect demonstrations within the actor-critic RL frameworks, and propose a new method called learning from imperfect demonstrations with advantage rectification (LIDAR). LIDAR utilizes a rectified loss function to merely learn from selective demonstrations, which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy. LIDAR learns from contradictions caused by estimation errors, and in turn reduces estimation errors. We apply LIDAR to three popular actor-critic algorithms, DDPG, TD3 and SAC, and experiments show that our method can observably reduce the function estimation errors, effectively leverage demonstrations far from the optimal, and outperform state-of-the-art baselines consistently in all the scenarios.

Keywords

learning from demonstrations / actor-critic reinforcement learning / advantage rectification

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Xiaoqin ZHANG, Huimin MA, Xiong LUO, Jian YUAN. LIDAR: learning from imperfect demonstrations with advantage rectification. Front. Comput. Sci., 2022, 16(1): 161312 https://doi.org/10.1007/s11704-021-0147-9

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]

Mnih V , Kavukcuoglu K , Silver D , Rusu A A , Veness J , Bellemare M G , Graves A , Riedmiller M , Fidjeland A K , Ostrovski G , Petersen S , Beattie C , Sadik A , Antonoglou I , King H , Kumaran D , Wierstra D , Legg S , Hassabis D . Human-level control through deep reinforcement learning. Nature, 2015, 518 (7540): 529- 533

CrossRef Google scholar

[2]

Silver D , Huang A , Maddison C J , Guez A , Sifre L , Van Den Driessche G , Schrittwieser J , Antonoglou I , Panneershelvam V , Lanctot M , Dieleman S , Grewe D , Nham J , Kalchbrenner N , Sutskever I , Lillicrap T , Leach M , Kavukcuoglu K , Graepel T , Hassabis D . Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529 (7587): 484- 489

CrossRef Google scholar

[3]	Mnih V , Badia A P , Mirza M , Graves A , Lillicrap T , Harley T , Silver D , Kavukcuoglu K . Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928 1937

[4]	Schulman J , Wolski F , Dhariwal P , Radford A , Klimov O . Proximal policy optimization algorithms. 2017, arXiv preprint arXiv:1707.06347

[5]	Fujimoto S , Hoof H , Meger D . Addressing function approximation error in actor-critic methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1582 1591

[6]	Lakshminarayanan A S , Ozair S , Bengio Y . Reinforcement learning with few expert demonstrations. In: Proceedings of Neural Information Processing Systems Workshop on Deep Learning for Action and Interaction. 2016

[7]	Rajeswaran A , Kumar V , Gupta A , Vezzani G , Schulman J , Todorov E , Levine S . Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. 2017, arXiv preprint arXiv:1709.10087 CrossRef Google scholar

[8]	Nair A , McGrew B , Andrychowicz M , Zaremba W , Abbeel P . Overcoming exploration in reinforcement learning with demonstrations. In: Proceedings of 2018 IEEE International Conference on Robotics and Automation. 2018, 6292 6299 CrossRef Google scholar

[9]	Ebrahimi S , Rohrbach A , Darrell T . Gradient-free policy architecture search and adaptation. In: Proceedings of the Conference on Robot Learning. 2017, 505 514

[10]	Reddy S , Dragan A D , Levine S . SQIL: imitation learning via regularized behavioral cloning. 2019, arXiv preprint arXiv:1905.11108

[11]	Ho J , Ermon S . Generative adversarial imitation learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4565 4573

[12]	Todorov E , Erez T , Tassa Y . Mujoco: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012 5026 5033 CrossRef Google scholar

[13]	Kang B Y , Jie Z Q , Feng J S . Policy optimization with demonstrations. In: Proceedings of International Conference on Machine Learning. 2018, 2469 2478

[14]	Lillicrap T P , Hunt J J , Pritzel A , Heess N , Erez T , Tassa Y , Silver D , Wierstra D . Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference on Learning Representations. 2016

[15]	Haarnoja T , Zhou A , Hartikainen K , Tucker G , Ha S , Tan J , Kumar V , Zhu H , Gupta A , Abbeel P , Levine S . Soft actor-critic algorithms and applications. 2018, arXiv preprint arXiv:1812.05905

[16]	Ng A Y , Harada D , Russell S . Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the 16th International Conference on Machine Learning. 1999, 278 287 CrossRef Google scholar

[17]	Brys T , Harutyunyan A , Suay H B , Chernova S , Taylor M E , Nowé A . Reinforcement learning from demonstration through shaping. In: Proceedings of the 32nd International Joint Conferences on Artificial Intelligence. 2015, 3352 3358

[18]	Jing M X , Ma X J , Huang W B , Sun F C , Yang C , Fang B , Liu H P . Reinforcement learning from imperfect demonstrations under soft expert guidance. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2020, 5109 5116

[19]	Schulman J , Levine S , Abbeel P , Jordan M I , Moritz P . Trust region policy optimization. In: Proceedings of International Conference on Machine Learning. 2015, 1889 1897

[20]	Abbeel P , Ng A Y . Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the 21st International Conference on Machine Learning. 2004 CrossRef Google scholar

[21]	Ng A Y , Russell S J . Algorithms for inverse reinforcement learning. In: Proceedings of the 17th International Conference on Machine Learning. 2000, 663 670

[22]	Li Y Z , Song J M , Ermon S . InfoGAIL: interpretable imitation learning from visual demonstrations. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 3815 3825

[23]	Shiarlis K , Messias J , Whiteson S . Inverse reinforcement learning from failure. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016, 1060 1068

[24]	Brown D S , Goo W , Nagarajan P , Niekum S . Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In: Proceedings of the International Conference on Machine Learning. 2019, 783 792

[25]	Nagarajan P . Inverse reinforcement learning via ranked and failed demonstrations. 2016

[26]	Oh J , Guo Y J , Singh S , Lee H . Self-imitation learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 3878 3887

[27]	Wu Y H , Charoenphakdee N , Bao H , Tangkaratt V , Sugiyama M . Imitation learning from imperfect demonstration. In: Proceedings of International Conference on Machine Learning. 2019, 6818 6827

[28]	Sun W , Bagnell J A , Boots B . Truncated horizon policy search: combining reinforcement learning & imitation learning. In: Proceedings of the 7th International Conference on Learning Representations. 2018

[29]	Vecerík M , Hester T , Scholz J , Wang F M , Pietquin O , Piot B , Heess N , Scholz J , Scholz J , Rothörl T , Lampe T , Riedmiller M A . Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. 2017, arXiv preprint arXiv:1707.08817

[30]	Gao Y , Xu H , Lin J , Yu F , Levine S , Darrell T . Reinforcement learning from imperfect demonstrations. 2018, arXiv preprint arXiv:1802.05313

[31]	Silver D , Lever G , Heess N , Degris T , Wierstra D , Riedmiller M . Deterministic policy gradient algorithms. In: Proceedings of the 31st International Conference on Machine Learning. 2014, 387 395

[32]	Sutton R S , McAllester D A , Singh S , Mansour Y . Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. 1999, 1057 1063

[33]	Munos R , Stepleton T , Harutyunyan A , Bellemare M . Safe and efficient off-policy reinforcement learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1054 1062

[34]	Kakade S , Langford J . Approximately optimal approximate reinforcement learning. In: Proceedings of the 19th International Conference on Machine Learning. 2002, 267 274