Clustered Reinforcement Learning
Xiao MA, Shen-Yi ZHAO, Zhao-Heng YIN, Wu-Jun LI
Clustered Reinforcement Learning
Exploration strategy design is a challenging problem in reinforcement learning (RL), especially when the environment contains a large state space or sparse rewards. During exploration, the agent tries to discover unexplored (novel) areas or high reward (quality) areas. Most existing methods perform exploration by only utilizing the novelty of states. The novelty and quality in the neighboring area of the current state have not been well utilized to simultaneously guide the agent’s exploration. To address this problem, this paper proposes a novel RL framework, called clustered reinforcement learning (CRL), for efficient exploration in RL. CRL adopts clustering to divide the collected states into several clusters, based on which a bonus reward reflecting both novelty and quality in the neighboring area (cluster) of the current state is given to the agent. CRL leverages these bonus rewards to guide the agent to perform efficient exploration. Moreover, CRL can be combined with existing exploration strategies to improve their performance, as the bonus rewards employed by these existing exploration strategies solely capture the novelty of states. Experiments on four continuous control tasks and six hard-exploration Atari-2600 games show that our method can outperform other state-of-the-art methods to achieve the best performance.
deep reinforcement learning / exploration / count-based method / clustering / K-means
Xiao Ma received the BSc degree in information and computing science from Xi’an Jiaotong University, China. She is currently working toward the PhD degree in the Department of Computer Science and Technology, Nanjing University of China. Her research interests are in reinforcement learning, machine learning, and artificial intelligence
Shen-Yi Zhao received the BSc degree in mathematics from the Nanjing University of China and the PhD degree in computer science from the Nanjing University of China. His research interests are in machine learning, convex optimization, parallel and distributed optimization
Zhao-Heng Yin received the BSc degree in computer science from the Nanjing University of China and received the MPhil degree in electronic and computer engineering from The Hong Kong University of Science and technology, China. He is currently working toward the PhD degree in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley, USA. His research interests include machine learning and robotics
Wu-Jun Li received the BSc and MEng degrees in computer science from the Nanjing University of China, and the PhD degree in computer science from The Hong Kong University of Science and Technology, China. He started his academic career as an assistant professor in the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China. He then joined Nanjing University, China, where he is currently a professor in the Department of Computer Science and Technology. His research interests are in machine learning, big data, and artificial intelligence
[1] |
Sutton R S, Barto A G. Reinforcement Learning: an Introduction. Cambridge: MIT Press, 1998
|
[2] |
Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D . Human-level control through deep reinforcement learning. Nature, 2015, 518( 7540): 529–533
|
[3] |
Silver D, Huang A, Maddison C J, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T P, Leach M, Kavukcuoglu K, Graepel T, Hassabis D . Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529( 7587): 484–489
|
[4] |
Lample G, Chaplot D S. Playing FPS games with deep reinforcement learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2140−2146
|
[5] |
Badia A P, Piot B, Kapturowski S, Sprechmann P, Vitvitskyi A, Guo D, Blundell C. Agent57: Outperforming the Atari human benchmark. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 48
|
[6] |
Ma X, Li W J . State-based episodic memory for multi-agent reinforcement learning. Machine Learning, 2023, 112( 12): 5163–5190
|
[7] |
Singh B, Kumar R, Singh V P . Reinforcement learning in robotic applications: a comprehensive survey. Artificial Intelligence Review, 2022, 55( 2): 945–990
|
[8] |
Wen Y, Si J, Brandt A, Gao X, Huang H H . Online reinforcement learning control for the personalization of a robotic knee prosthesis. IEEE Transactions on Cybernetics, 2020, 50( 6): 2346–2356
|
[9] |
Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference on Learning Representations. 2016
|
[10] |
Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P. Benchmarking deep reinforcement learning for continuous control. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1329−1338
|
[11] |
Modares H, Ranatunga I, Lewis F L, Popa D O . Optimized assistive human-robot interaction using reinforcement learning. IEEE Transactions on Cybernetics, 2016, 46( 3): 655–667
|
[12] |
Amarjyoti S. Deep reinforcement learning for robotic manipulation-the state of the art. 2017, arXiv preprint arXiv: 1701.08878
|
[13] |
Xu Y, Fang M, Chen L, Du Y, Zhou J, Zhang C. Perceiving the world: Question-guided reinforcement learning for text-based games. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 538−560
|
[14] |
Ghalandari D, Hokamp C, Ifrim G. Efficient unsupervised sentence compression by fine-tuning transformers with reinforcement learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022, 1267−1280
|
[15] |
Li H, Hu Y, Cao Y, Zhou G, Luo P . Rich-text document styling restoration via reinforcement learning. Frontiers of Computer Science, 2021, 15( 4): 154328
|
[16] |
Yau K L A, Kwong K H, Shen C . Reinforcement learning models for scheduling in wireless networks. Frontiers of Computer Science, 2013, 7( 5): 754–766
|
[17] |
Qin Y, Wang H, Yi S, Li X, Zhai L . A multi-objective reinforcement learning algorithm for deadline constrained scientific workflow scheduling in clouds. Frontiers of Computer Science, 2021, 15( 5): 155105
|
[18] |
Lin Y C, Chen C T, Sang C Y, Huang S H . Multiagent-based deep reinforcement learning for risk-shifting portfolio management. Applied Soft Computing, 2022, 123: 108894
|
[19] |
Zhang Y, Zhao P, Wu Q, Li B, Huang J, Tan M . Cost-sensitive portfolio selection via deep reinforcement learning. IEEE Transactions on Knowledge and Data Engineering, 2022, 34( 1): 236–248
|
[20] |
Li X, Cui C, Cao D, Du J, Zhang C. Hypergraph-based reinforcement learning for stock portfolio selection. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 4028−4032
|
[21] |
Xu K, Zhang Y, Ye D, Zhao P, Tan M. Relation-aware transformer for portfolio policy learning. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence. 2020, 641
|
[22] |
Wang Z, Huang B, Tu S, Zhang K, Xu L. DeepTrader: A deep reinforcement learning approach for risk-return balanced portfolio management with market conditions embedding. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 643−650
|
[23] |
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P F, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022
|
[24] |
Tang H R, Houthooft R, Foote D, Stooke A, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P. #exploration: A study of count-based exploration for deep reinforcement learning. In: Proceedings of the 31th International Conference on Neural Information Processing Systems. 2017, 2753−2762
|
[25] |
Qian H, Yu Y . Derivative-free reinforcement learning: a review. Frontiers of Computer Science, 2021, 15( 6): 156336
|
[26] |
Chapelle O, Li L. An empirical evaluation of Thompson sampling. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 2249−2257
|
[27] |
Mnih V, Badia A P, Mirza M, Graves A, Harley T, Lillicrap T P, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928−1937
|
[28] |
Fortunato M, Azar M G, Piot B, Menick J, Hessel M, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S. Noisy networks for exploration. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
[29] |
Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen R Y, Chen X, Asfour T, Abbeel P, Andrychowicz M. Parameter space noise for exploration. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
[30] |
Osband I, Blundell C, Pritzel A, Van Roy B. Deep exploration via bootstrapped DQN. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4033−4041
|
[31] |
Osband I, Van Roy B, Russo D J, Wen Z . Deep exploration via randomized value functions. Journal of Machine Learning Research, 2019, 20( 124): 1–62
|
[32] |
Kearns M, Singh S. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002, 49(2−3): 209−232
|
[33] |
Brafman R I, Tennenholtz M . R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2003, 3: 213–231
|
[34] |
Bellemare M G, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R. Unifying count-based exploration and intrinsic motivation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1479−1487
|
[35] |
Ostrovski G, Bellemare M G, Van Den Oord A, Munos R. Count-based exploration with neural density models. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2721−2730
|
[36] |
Houthooft R, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P. VIME: variational information maximizing exploration. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 1117−1125
|
[37] |
Stadie B C, Levine S, Abbeel P. Incentivizing exploration in reinforcement learning with deep predictive models. 2015, arXiv preprint arXiv: 1507.00814
|
[38] |
Pathak D, Agrawal P, Efros A A, Darrell T. Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2778−2787
|
[39] |
Klyubin A S, Polani D, Nehaniv C L. Empowerment: a universal agent-centric measure of control. In: Proceedings of the IEEE Congress on Evolutionary Computation. 2005, 128−135
|
[40] |
Fu J, Co-Reyes J D, Levine S. EX2: exploration with exemplar models for deep reinforcement learning. In: Proceedings of the 31th International Conference on Neural Information Processing Systems. 2017, 2577−2587
|
[41] |
Burda Y, Edwards H, Storkey A J, Klimov O. Exploration by random network distillation. In: Proceedings of the 7th International Conference on Learning Representations. 2019
|
[42] |
Zhang T, Xu H, Wang X, Wu Y, Keutzer K, Gonzalez J E, Tian Y. NovelD: A simple yet effective exploration criterion. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 25217−25230
|
[43] |
Auer P, Ortner R. Logarithmic online regret bounds for undiscounted reinforcement learning. In: Proceedings of the 19th International Conference on Neural Information Processing Systems. 2006, 49−56
|
[44] |
Osband I, Russo D, Van Roy B. (More) efficient reinforcement learning via posterior sampling. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3003−3011
|
[45] |
Ecoffet A, Huizinga J, Lehman J, Stanley K O, Clune J. Go-explore: a new approach for hard-exploration problems. 2019, arXiv preprint arXiv: 1901.10995
|
[46] |
Ecoffet A, Huizinga J, Lehman J, Stanley K O, Clune J . First return, then explore. Nature, 2021, 590( 7847): 580–586
|
[47] |
Bellemare M G, Naddaf Y, Veness J, Bowling M . The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013, 47: 253–279
|
[48] |
Strehl A L, Littman M L . An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 2008, 74( 8): 1309–1331
|
[49] |
Ortner R . Adaptive aggregation for reinforcement learning in average reward Markov decision processes. Annals of Operations Research, 2013, 208( 1): 321–336
|
[50] |
Barto A G. Intrinsic motivation and reinforcement learning. In: Baldassarre G, Mirolli M, eds. Intrinsically Motivated Learning in Natural and Artificial Systems. Berlin: Springer, 2013, 17−47
|
[51] |
Berlyne D E. Structure and Direction in Thinking. Hoboken: Wiley, 1965
|
[52] |
Mannor S, Menache I, Hoze A, Klein U. Dynamic abstraction in reinforcement learning via clustering. In: Proceedings of the 21st International Conference on Machine Learning. 2004
|
[53] |
Tziortziotis N, Blekas K. A model based reinforcement learning approach using on-line clustering. In: Proceedings of the IEEE International Conference on Tools with Artificial Intelligence. 2012, 712−718
|
[54] |
Wang T, Gupta T, Mahajan A, Peng B, Whiteson S, Zhang C J. RODE: learning roles to decompose multi-agent tasks. In: Proceedings of the 9th International Conference on Learning Representations. 2021
|
[55] |
Christianos F, Papoudakis G, Rahman A, Albrecht S V. Scaling multi-agent reinforcement learning with selective parameter sharing. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 1989−1998
|
[56] |
Mandel T, Liu Y E, Brunskill E, Popovic Z. Efficient Bayesian clustering for reinforcement learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016, 1830−1838
|
[57] |
Coates A, Ng A Y. Learning feature representations with K-means. In: Montavon G, Orr G B, Müller K R, eds. Neural Networks: Tricks of the Trade. 2nd ed. Berlin: Springer, 2012, 561−580
|
[58] |
Schulman J, Levine S, Moritz P, Jordan M, Abbeel P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 1889−1897
|
[59] |
Burda Y, Edwards H, Pathak D, Storkey A J, Darrell T, Efros A A. Large-scale study of curiosity-driven learning. In: Proceedings of the 7th International Conference on Learning Representations. 2019
|
[60] |
Wang K, Zhou K, Kang B, Feng J, Yan S. Revisiting intrinsic reward for exploration in procedurally generated environments. In: Proceedings of the 11th International Conference on Learning Representations. 2023
|
[61] |
Charikar M S. Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing. 2002, 380−388
|
[62] |
Voloshin C, Le H M, Jiang N, Yue Y. Empirical study of off-policy policy evaluation for reinforcement learning. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. 2021
|
[63] |
Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 807−814
|
[64] |
Maas A L, Hannun A Y, Ng A Y. Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th International Conference on Machine Learning. 2013
|
[65] |
Van Der Maaten L, Hinton G . Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9( 86): 2579–2605
|
/
〈 | 〉 |