Derivative-free reinforcement learning: a review
Hong QIAN, Yang YU
Derivative-free reinforcement learning: a review
Reinforcement learning is about learning agent models that make the best sequential decisions in unknown environments. In an unknown environment, the agent needs to explore the environment while exploiting the collected information, which usually forms a sophisticated problem to solve. Derivative-free optimization, meanwhile, is capable of solving sophisticated problems. It commonly uses a sampling-andupdating framework to iteratively improve the solution, where exploration and exploitation are also needed to be well balanced. Therefore, derivative-free optimization deals with a similar core issue as reinforcement learning, and has been introduced in reinforcement learning approaches, under the names of learning classifier systems and neuroevolution/evolutionary reinforcement learning. Although such methods have been developed for decades, recently, derivative-free reinforcement learning exhibits attracting increasing attention. However, recent survey on this topic is still lacking. In this article, we summarize methods of derivative-free reinforcement learning to date, and organize the methods in aspects including parameter updating, model selection, exploration, and parallel/distributed methods. Moreover, we discuss some current limitations and possible future directions, hoping that this article could bring more attentions to this topic and serve as a catalyst for developing novel and efficient approaches.
reinforcement learning / derivative-free optimization / neuroevolution reinforcement learning / neural architecture search
[1] |
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, Massachusetts: MIT Press, 1998
CrossRef
Google scholar
|
[2] |
Wiering M, Van Otterlo M. Reinforcement Learning: State-of-the-Art. Berlin, Heidelberg: Springer, 2012
CrossRef
Google scholar
|
[3] |
Dietterich T G. Machine learning research: four current directions. Artificial Intelligence Magazine, 1997, 18(4): 97–136
|
[4] |
Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
CrossRef
Google scholar
|
[5] |
Silver D, Huang A, Maddison C J, Guez A, Sifre L, Driessche G V D, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484–489
CrossRef
Google scholar
|
[6] |
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap T, Simonyan K, Hassabis D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 2018, 362(6419): 1140–1144
CrossRef
Google scholar
|
[7] |
Abbeel P, Coates A, Quigley M, Ng A Y. An application of reinforcement learning to aerobatic helicopter flight. In: Proceedings of the 19th International Conference on Neural Information Processing Systems. 2006, 1–8
|
[8] |
Zoph B, Le Q V. Neural architecture search with reinforcement learning. In: Proceedings of the 5th International Conference on Learning Representations. 2017
|
[9] |
Huang C, Lucey S, Ramanan D. Learning policies for adaptive tracking with deep feature cascades. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 105–114
CrossRef
Google scholar
|
[10] |
Yu L, Zhang W, Wang J, Yu Y. SeqGAN: sequence generative adversarial nets with policy gradient. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2852–2858
|
[11] |
Wang Y C, Usher J M. Application of reinforcement learning for agentbased production scheduling. Engineering Applications of Artificial Intelligence, 2005, 18(1): 73–82
CrossRef
Google scholar
|
[12] |
Choi J J, Laibson D, Madrian B C, Metrick A. Reinforcement learning and savings behavior. The Journal of Finance, 2009, 64(6): 2515–2534
CrossRef
Google scholar
|
[13] |
Shi J C, Yu Y, Da Q, Chen S Y, Zeng A. Virtual-taobao: virtualizing real-world online retail environment for reinforcement learning. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 4902–4909
CrossRef
Google scholar
|
[14] |
Boyan J A, Littman M L. Packet routing in dynamically changing networks: a reinforcement learning approach. In: Proceedings of the 6th International Conference on Neural Information Processing Systems. 1993, 671–678
|
[15] |
Frank M J, Seeberger L C, O’reilly R C. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 2004, 306(5703): 1940–1943
CrossRef
Google scholar
|
[16] |
Samejima K, Ueda Y, Doya K, Kimura M. Representation of actionspecific reward values in the striatum. Science, 2005, 310(5752): 1337–1340
CrossRef
Google scholar
|
[17] |
Shalev-Shwartz S, Shamir O, Shammah S. Failures of gradient-based deep learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 3067–3075
|
[18] |
Conn A R, Scheinberg K, Vicente L N. Introduction to Derivative-Free Optimization. Philadelphia, PA: SIAM, 2009
CrossRef
Google scholar
|
[19] |
Kolda T G, Lewis R M, Torczon V. Optimization by direct search: new perspectives on some classical and modern methods. SIAM Review, 2003, 45(3): 385–482
CrossRef
Google scholar
|
[20] |
Rios L M, Sahinidis N V. Derivative-free optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization, 2013, 56(3): 1247–1293
CrossRef
Google scholar
|
[21] |
Sigaud O, Wilson SW. Learning classifier systems: a survey. Soft Computing, 2007, 11(11): 1065–1078
CrossRef
Google scholar
|
[22] |
Moriarty D E, Schultz A C, Grefenstette J J. Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research, 1999, 11: 241–276
CrossRef
Google scholar
|
[23] |
Whiteson S. Evolutionary computation for reinforcement learning. In: Wiering M, van Otterlo M, eds. Reinforcement Learning: State-of-the-Art. Springer, Berlin, Heidelberg, 2012, 325–355
CrossRef
Google scholar
|
[24] |
Bellman R. A Markovian decision process. Journal of Mathematics and Mechanics, 1957, 6(5): 679–684
CrossRef
Google scholar
|
[25] |
Bartlett P L, Baxter J. Infinite-horizon policy gradient estimation. Journal of Artificial Intelligence Research, 2001, 15: 319–350
CrossRef
Google scholar
|
[26] |
Holland J H. Adaptation in Natural and Artificial Systems. Ann Arbor, MI: The University of Michigan Press, 1975
|
[27] |
Hansen N, Müller S D, Koumoutsakos P. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 2003, 11(1): 1–18
CrossRef
Google scholar
|
[28] |
Shahriari B, Swersky K, Wang Z, Adams R P, Freitas d N. Taking the human out of the loop: a review of Bayesian optimization. Proceedings of the IEEE, 2016, 104(1): 148–175
CrossRef
Google scholar
|
[29] |
De Boer P T, Kroese D P, Mannor S, Rubinstein R Y. A tutorial on the cross-entropy method. Annals of Operations Research, 2005, 134(1): 19–67
CrossRef
Google scholar
|
[30] |
Munos R. From bandits to Monte-Carlo tree search: the optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, 2014, 7(1): 1–129
CrossRef
Google scholar
|
[31] |
Yu Y, Qian H, Hu Y Q. Derivative-free optimization via classification. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 2286–2292
|
[32] |
He J, Yao X. Drift analysis and average time complexity of evolutionary algorithms. Artificial Intelligence, 2001, 127(1): 57–85
CrossRef
Google scholar
|
[33] |
Yu Y, Zhou Z H. A new approach to estimating the expected first hitting time of evolutionary algorithms. Artificial Intelligence, 2008, 172(15): 1809–1832
CrossRef
Google scholar
|
[34] |
Bull A D. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 2011, 12: 2879–2904
|
[35] |
Jamieson K G, Nowak R D, Recht B. Query complexity of derivativefree optimization. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 2681–2689
|
[36] |
Yu Y, Qian H. The sampling-and-learning framework: a statistical view of evolutionary algorithms. In: Proceedings of the 2014 IEEE Congress on Evolutionary Computation. 2014, 149–158
CrossRef
Google scholar
|
[37] |
Duchi J C, Jordan M I, Wainwright M J, Wibisono A. Optimal rates for zero-order convex optimization: the power of two function evaluations. IEEE Transactions on Information Theory, 2015, 61(5): 2788–2806
CrossRef
Google scholar
|
[38] |
Yu Y, Qian C, Zhou Z H. Switch analysis for running time analysis of evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 2015, 19(6): 777–792
CrossRef
Google scholar
|
[39] |
Kawaguchi K, Kaelbling L P, Lozano-Perez T. Bayesian optimization with exponential convergence. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2809–2817
|
[40] |
Kawaguchi K, Maruyama Y, Zheng X. Global continuous optimization with error bound and fast convergence. Journal of Artificial Intelligence Research, 2016, 56: 153–195
CrossRef
Google scholar
|
[41] |
Mitchell M. An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press, 1998
CrossRef
Google scholar
|
[42] |
Taylor M E,Whiteson S, Stone P. Comparing evolutionary and temporal difference methods in a reinforcement learning domain. In: Proceedings of the 2006 Conference on Genetic and Evolutionary Computation. 2006, 1321–1328
CrossRef
Google scholar
|
[43] |
Abdolmaleki A, Lioutikov R, Peters J, Lau N, Reis L P, Neumann G. Model-based relative entropy stochastic search. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 3537–3545
|
[44] |
Hu Y Q, Qian H, Yu Y. Sequential classification-based optimization for direct policy search. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2029–2035
|
[45] |
Salimans T, Ho J, Chen X, Sidor S, Sutskever I. Evolution strategies as a scalable alternative to reinforcement learning. 2017, arXiv:1703.03864
|
[46] |
Snoek J, Larochelle H, Adams R P. Practical Bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 2960–2968
|
[47] |
Thornton C, Hutter F, Hoos H H, Leyton-Brown K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 847–855
CrossRef
Google scholar
|
[48] |
Real E, Moore S, Selle A, Saxena S, Suematsu Y L, Tan J, Le Q V, Kurakin A. Large-scale evolution of image classifiers. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2902–2911
|
[49] |
Real E, Aggarwal A, Huang Y, Le Q V. Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 4780–4789
CrossRef
Google scholar
|
[50] |
Zhang Y, Sohn K, Villegas R, Pan G, Lee H. Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 249–258
CrossRef
Google scholar
|
[51] |
Qian C, Yu Y, Zhou Z H. Subset selection by pareto optimization. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 1765–1773
|
[52] |
Qian C, Shi J C, Yu Y, Tang K. On subset selection with general cost constraints. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 2613–2619
CrossRef
Google scholar
|
[53] |
Brown M, An B, Kiekintveld C, Ordóñez F, Tambe M. An extended study onmulti-objective security games. Autonomous Agents andMulti-Agent Systems, 2014, 28(1): 31–71
CrossRef
Google scholar
|
[54] |
Domingos P M. A few useful things to know about machine learning. Communications of the ACM, 2012, 55(10): 78–87
CrossRef
Google scholar
|
[55] |
Yu Y. Towards sample efficient reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 5739–5743
CrossRef
Google scholar
|
[56] |
Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen R Y, Chen X, Asfour T, Abbeel P, Andrychowicz M. Parameter space noise for exploration. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
[57] |
Pathak D, Agrawal P, Efros A A, Darrell T. Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2778–2787
CrossRef
Google scholar
|
[58] |
Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P. Benchmarking deep reinforcement learning for continuous control. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1329–1338
|
[59] |
Schulman J, Levine S, Abbeel P, Jordan M I, Moritz P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 1889–1897
|
[60] |
Bach F R, Perchet V. Highly-smooth zero-th order online optimization. In: Proceedings of the 29th Conference on Learning Theory. 2016, 257–283
|
[61] |
Qian H, Yu Y. Scaling simultaneous optimistic optimization for highdimensional non-convex functions with low effective dimensions. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016, 2000–2006
|
[62] |
Yao X. Evolving artificial neural networks. Proceedings of the IEEE, 1999, 87(9): 1423–1447
CrossRef
Google scholar
|
[63] |
Stanley K O, Clune J, Lehman J, Miikkulainen R. Designing neural networks through neuroevolution. Nature Machine Intelligence, 2019, 1(1): 24–35
CrossRef
Google scholar
|
[64] |
Such F P, Madhavan V, Conti E, Lehman J, Stanley K O, Clune J. Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. 2017, arXiv preprint arXiv:1712.06567
|
[65] |
Morse G, Stanley K O. Simple evolutionary optimization can rival stochastic gradient descent in neural networks. In: Proceedings of the 2016 Conference on Genetic and Evolutionary Computation. 2016, 477–484
CrossRef
Google scholar
|
[66] |
Zhang X, Clune J, Stanley K O. On the relationship between the OpenAI evolution strategy and stochastic gradient descent. 2017, arXiv preprint arXiv:1712.06564
|
[67] |
Koutník J, Cuccu G, Schmidhuber J, Gomez F J. Evolving large-scale neural networks for vision-based reinforcement learning. In: Proceedings of the 2013 Conference on Genetic and Evolutionary Computation. 2013, 1061–1068
CrossRef
Google scholar
|
[68] |
Hausknecht M J, Lehman J, Miikkulainen R, Stone P. A neuroevolution approach to general Atari game playing. IEEE Transactions on Computational Intelligence and AI in Games, 2014, 6(4): 355–366
CrossRef
Google scholar
|
[69] |
Risi S, Togelius J. Neuroevolution in games: state of the art and open challenges. IEEE Transactions on Computational Intelligence and AI in Games, 2017, 9(1): 25–41
CrossRef
Google scholar
|
[70] |
Chrabaszcz P, Loshchilov I, Hutter F. Back to basics: benchmarking canonical evolution strategies for playing atari. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 1419–1426
CrossRef
Google scholar
|
[71] |
Mania H, Guy A, Recht B. Simple random search of static linear policies is competitive for reinforcement learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 1805–1814
|
[72] |
Malik D, Pananjady A, Bhatia K, Khamaru K, Bartlett P, Wainwright M J. Derivative-free methods for policy optimization: guarantees for linear quadratic systems. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2019, 2916–2925
|
[73] |
Hansen N, Arnold D V, Auger A. Evolution strategies. In: Kacprzyk J, Pedrycz W, eds. Springer Handbook of Computational Intelligence. Springer, Berlin, Heidelberg, 2015, 871–898
CrossRef
Google scholar
|
[74] |
Hansen N, Ostermeier A. Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In: Proceedings of 1996 IEEE International Conference on Evolutionary Computation. 1996, 312–317
|
[75] |
Hansen N, Ostermeier A. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 2001, 9(2): 159–195
CrossRef
Google scholar
|
[76] |
Heidrich-Meisner V, Igel C. Evolution strategies for direct policy search. In: Proceedings of the 10th International Conference on Parallel Problem Solving from Nature. 2008, 428–437
CrossRef
Google scholar
|
[77] |
Heidrich-Meisner V, Igel C. Neuroevolution strategies for episodic reinforcement learning. Journal of Algorithms, 2009, 64(4): 152–168
CrossRef
Google scholar
|
[78] |
Peters J, Schaal S. Natural actor-critic. Neurocomputing, 2008, 71(7-9): 1180–1190
CrossRef
Google scholar
|
[79] |
Heidrich-Meisner V, Igel C. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search. In: Proceedings of the 26th International Conference on Machine Learning. 2009, 401–408
CrossRef
Google scholar
|
[80] |
Stulp F, Sigaud O. Path integral policy improvement with covariance matrix adaptation. In: Proceedings of the 29th International Conference on Machine Learning. 2012
|
[81] |
Szita I, Lörincz A. Learning Tetris using the noisy cross-entropy method. Neural Computation, 2006, 18(12): 2936–2941
CrossRef
Google scholar
|
[82] |
Wierstra D, Schaul T, Peters J, Schmidhuber J. Natural evolution strategies. In: Proceedings of the 2008 IEEE Congress on Evolutionary Computation. 2008, 3381–3387
CrossRef
Google scholar
|
[83] |
Wierstra D, Schaul T, Glasmachers T, Sun Y, Peters J, Schmidhuber J. Natural evolution strategies. Journal of Machine Learning Research, 2014, 15(1): 949–980
|
[84] |
Salimans T, Goodfellow I J, Zaremba W, Cheung V, Radford A, Chen X. Improved techniques for training GANs. In: Proceedings of the 29th International Conference on Neural Information Processing Systems. 2016, 2226–2234
|
[85] |
Geweke J. Antithetic acceleration of Monte Carlo integration in Bayesian inference. Journal of Econometrics, 1988, 38(1–2): 73–89
CrossRef
Google scholar
|
[86] |
Brockhoff D, Auger A, Hansen N, Arnold D V, Hohm T. Mirrored sampling and sequential selection for evolution strategies. In: Proceedings of the 11th International Conference on Parallel Problem Solving from Nature. 2010, 11–21
CrossRef
Google scholar
|
[87] |
Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, 5026–5033
CrossRef
Google scholar
|
[88] |
Bellemare M G, Naddaf Y, Veness J, Bowling M. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013, 47: 253–279
CrossRef
Google scholar
|
[89] |
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, Zaremba W. OpenAI Gym. 2016, arXiv preprint arXiv:1606.01540
|
[90] |
Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
|
[91] |
Lehman J, Chen J, Clune J, Stanley K O. ES is more than just a traditional finite-difference approximator. In: Proceedings of the 2018 Conference on Genetic and Evolutionary Computation. 2018, 450–457
CrossRef
Google scholar
|
[92] |
Choromanski K, Rowland M, Sindhwani V, Turner R E, Weller A. Structured evolution with compact architectures for scalable policy optimization. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 969–977
|
[93] |
Chen Z, Zhou Y, He X, Jiang S. A restart-based rank-1 evolution strategy for reinforcement learning. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 2130–2136
CrossRef
Google scholar
|
[94] |
Choromanski K, Pacchiano A, Parker-Holder J, Tang Y, Sindhwani V. From complexity to simplicity: adaptive ES-active subspaces for blackbox optimization. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2019
|
[95] |
Constantine P G. Active Subspaces — Emerging Ideas for Dimension Reduction in Parameter Studies. volume 2 of SIAM spotlights. Philadelphia, PA: SIAM, 2015
CrossRef
Google scholar
|
[96] |
Liu G, Zhao L, Yang F, Bian J, Qin T, Yu N, Liu T Y. Trust region evolution strategies. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 4352–4359
CrossRef
Google scholar
|
[97] |
Tang Y, Choromanski K, Kucukelbir A. Variance reduction for evolution strategies via structured control variates. In: Proceedings of International Conference on Artifical Intelligence and Statistics. 2020, 646–656
|
[98] |
Fuks L, Awad N, Hutter F, Lindauer M. An evolution strategy with progressive episode lengths for playing games. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 1234–1240
CrossRef
Google scholar
|
[99] |
Houthooft R, Chen Y, Isola P, Stadie B C, Wolski F, Ho J, Abbeel P. Evolved policy gradients. In: Proceedings of the 31th International Conference on Neural Information Processing Systems. 2018, 5405–5414
|
[100] |
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017, arXiv preprint arXiv:1707.06347
|
[101] |
Duan Y, Schulman J, Chen X, Bartlett P L, Sutskever I, Abbeel P. RL2: Fast reinforcement learning via slow reinforcement learning. 2016, arXiv preprint arXiv:1611.02779
|
[102] |
Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 1126–1135
|
[103] |
Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2018, 2455–2467
|
[104] |
Yu W, Liu C K, Turk G. Policy transfer with strategy optimization. In: Proceedings of the 7th International Conference on Learning Representations. 2019
|
[105] |
Lehman J, Stanley K O. Abandoning objectives: evolution through the search for novelty alone. Evolutionary Computation, 2011, 19(2): 189–223
CrossRef
Google scholar
|
[106] |
Gangwani T, Peng J. Policy optimization by genetic distillation. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
[107] |
Ross S, Gordon G J, Bagnell D. A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 2011, 627–635
|
[108] |
Bodnar C, Day B, Lió P. Proximal distilled evolutionary reinforcement learning. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 3283–3290
CrossRef
Google scholar
|
[109] |
Khadka S, Tumer K. Evolution-guided policy gradient in reinforcement learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2018, 1196–1208
|
[110] |
Lehman J, Chen J, Clune J, Stanley K O. Safe mutations for deep and recurrent neural networks through output gradients. In: Proceedings of the 2018 Conference on Genetic and Evolutionary Computation. 2018, 117–124
CrossRef
Google scholar
|
[111] |
Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1582–1591
|
[112] |
Rasmussen C E, Williams C K I. Gaussian Processes for Machine Learning. Cambridge, Massachusetts: MIT Press, 2006
CrossRef
Google scholar
|
[113] |
Kushner H J. A new method of locating the maximum of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 1964, 86: 97–106
CrossRef
Google scholar
|
[114] |
Močkus J, Tiesis V, Žilinskas A. Toward global optimization. In:
|
[115] |
Srinivas N, Krause A, Kakade S M, Seeger M W. Gaussian process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 1015–1022
|
[116] |
Freitas d N, Smola A J, Zoghi M. Exponential regret bounds for Gaussian process bandits with deterministic observations. In: Proceedings of the 29th International Conference on Machine Learning. 2012
|
[117] |
Brochu E, Cora V M, Freitas d N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. 2010, arXiv preprint arXiv:1012.2599
|
[118] |
Wilson A, Fern A, Tadepalli P. Using trajectory data to improve Bayesian optimization for reinforcement learning. Journal of Machine Learning Research, 2014, 15(1): 253–282
|
[119] |
Calandra R, Seyfarth A, Peters J, Deisenroth MP. An experimental comparison of Bayesian optimization for bipedal locomotion. In: Proceedings of the 2014 IEEE International Conference on Robotics and Automation. 2014, 1951–1958
CrossRef
Google scholar
|
[120] |
Calandra R, Seyfarth A, Peters J, Deisenroth MP. Bayesian optimization for learning gaits under uncertainty — an experimental comparison on a dynamic bipedal walker. Annals of Mathematics and Artificial Intelligence, 2016, 76(1-2): 5–23
CrossRef
Google scholar
|
[121] |
Marco A, Berkenkamp F, Hennig P, Schoellig A P, Krause A, Schaal S, Trimpe S. Virtual vs. real: trading off simulations and physical experiments in reinforcement learning with Bayesian optimization. In: Proceedings of the 2017 IEEE International Conference on Robotics and Automation. 2017, 1557–1563
CrossRef
Google scholar
|
[122] |
Letham B, Bakshy E. Bayesian optimization for policy search via onlineoffline experimentation. 2019, arXiv preprint arXiv:1904.01049
|
[123] |
Swersky K, Snoek J, Adams R P. Multi-task Bayesian optimization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 2004–2012
|
[124] |
Vien N A, Zimmermann H, Toussaint M. Bayesian functional optimization. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 4171–4178
|
[125] |
Vien N A, Dang V H, Chung T. A covariance matrix adaptation evolution strategy for direct policy search in reproducing kernel Hilbert space. In: Proceedings of The 9th Asian Conference on Machine Learning. 2017, 606–621
|
[126] |
Eriksson D, Pearce M, Gardner J R, Turner R, Poloczek M. Scalable global optimization via local Bayesian optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2019, 5497–5508
|
[127] |
Lozano J A, Larranaga P, Inza I, Bengoetxea E. Towards a New Evolutionary Computation: advances on Estimation of Distribution Algorithms. Berlin, Germany: Springer-Verlag, 2006
CrossRef
Google scholar
|
[128] |
Hashimoto T, Yadlowsky S, Duchi J C. Derivative free optimization via repeated classification. In: Proceedings of the 2018 International Conference on Artificial Intelligence and Statistics. 2018, 2027–2036
|
[129] |
Zhou A, Zhang J, Sun J, Zhang G. Fuzzy-classification assisted solution preselection in evolutionary optimization. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 2403–2410
CrossRef
Google scholar
|
[130] |
Dasgupta D, McGregor D. Designing application-specific neural networks using the structured genetic algorithm. In: Proceedings of the International Conference on Combinations of Genetic Algorithms and Neural Networks. 1992, 87–96
|
[131] |
Stanley K O, Miikkulainen R. Efficient reinforcement learning through evolving neural network topologies. In: Proceedings of the 2002 Conference on Genetic and Evolutionary Computation. 2002, 569–577
|
[132] |
Stanley K O, Miikkulainen R. Evolving neural networks through augmenting topologies. Evolutionary Computation, 2002, 10(2): 99–127
CrossRef
Google scholar
|
[133] |
Singh S P, Sutton R S. Reinforcement learning with replacing eligibility traces. Machine Learning, 1996, 22(1–3): 123–158
CrossRef
Google scholar
|
[134] |
Whiteson S, Stone P. Sample-efficient evolutionary function approximation for reinforcement learning. In: Proceedings of the 21st AAAI Conference on Artificial Intelligence. 2006, 518–523
|
[135] |
Whiteson S, Stone P. Evolutionary function approximation for reinforcement learning. Journal of Machine Learning Research, 2006, 7: 877–917
|
[136] |
Kohl N, Miikkulainen R. Evolving neural networks for strategic decision-making problems. Neural Networks, 2009, 22(3): 326–337
CrossRef
Google scholar
|
[137] |
Gauci J, Stanley K O. A case study on the critical role of geometric regularity in machine learning. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence. 2008, 628–633
|
[138] |
Hausknecht M J, Khandelwal P, Miikkulainen R, Stone P. HyperNEATGGP: a hyperNEAT-based Atari general game player. In: Proceedings of the 2012 Conference on Genetic and Evolutionary Computation. 2012, 217–224
CrossRef
Google scholar
|
[139] |
Ebrahimi S, Rohrbach A, Darrell T. Gradient-free policy architecture search and adaptation. In: Proceedings of the 1st Conference on Robot Learning. 2017, 505–514
|
[140] |
Zoph B, Le Q V. Neural architecture search with reinforcement learning. In: Proceedings of the 5th International Conference on Learning Representations. 2017
|
[141] |
Gaier A, Ha D. Weight agnostic neural networks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2019, 5365–5379
|
[142] |
Conti E, Madhavan V, Such F P, Lehman J, Stanley K O, Clune J. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2018, 5032–5043
|
[143] |
Chen X H, Yu Y. Reinforcement learning with derivative-free exploration. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 1880–1882
|
[144] |
Lillicrap T P, Hunt J J, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. Continuous control with deep reinforcement learning. In: Proceedings of the 4th International Conference on Learning Representations. 2016
|
[145] |
Vemula A, Sun W, Bagnell J A. Contrasting exploration in parameter and action space: a zeroth-order optimization perspective. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2019, 2926–2935
|
[146] |
Colas C, Sigaud O, Oudeyer P Y. GEP-PG: decoupling exploration and exploitation in deep reinforcement learning algorithms. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1038–1047
|
[147] |
Liu Y R, Hu Y Q, Qian H, Yu Y, Qian C. ZOOpt: toolbox for derivativefree optimization. 2017, arXiv preprint arXiv:1801.00329
|
[148] |
Jaderberg M, Dalibard V, Osindero S, Czarnecki W M, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K. Population based training of neural networks. 2017, arXiv preprint arXiv:1711.09846
|
[149] |
Beattie C, Leibo J Z, Teplyashin D, Ward T, Wainwright M, Küttler H, Lefrancq A, Green S, Valdés V, Sadik A, Schrittwieser J, Anderson K, York S, Cant M, Cain A, Bolton A, Gaffney S, King H, Hassabis D, Legg S, Petersen S. DeepMind Lab. 2016, arXiv preprint arXiv:1612.03801
|
[150] |
Vinyals O, Ewalds T, Bartunov S, Georgiev P, Vezhnevets A S, Yeo M, Makhzani A, Küttler H, Agapiou J P, Schrittwieser J, Quan J, Gaffney S, Petersen S, Simonyan K, Schaul T, Hasselt v H, Silver D, Lillicrap T P, Calderone K, Keet P, Brunasso A, Lawrence D, Ekermo A, Repp J, Tsing R. StarCraft II: a new challenge for reinforcement learning. 2017, arXiv preprint arXiv:1708.04782
|
[151] |
Moritz P, Nishihara R, Wang S, Tumanov A, Liaw R, Liang E, Paul W, Jordan M I, Stoica I. Ray: a distributed framework for emerging AI applications. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation. 2018, 561–577
|
[152] |
Elfwing S, Uchibe E, Doya K. Online meta-learning by parallel algorithm competition. In: Proceedings of the 2018 Conference on Genetic and Evolutionary Computation. 2018, 426–433
CrossRef
Google scholar
|
[153] |
Baker J E. Reducing bias and inefficiency in the selection algorithm. In: Proceedings of the 2nd International Conference on Genetic Algorithms. 1987, 14–21
|
[154] |
Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castaneda A G, Beattie C, Rabinowitz N C, Morcos A S, Ruderman A, Sonnerat N, Green T, Deason L, Leibo J Z, Silver D, Hassabis D, Kavukcuoglu K, Graepel T. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 2019, 364(6443): 859–865
CrossRef
Google scholar
|
[155] |
Jung W, Park G, Sung Y. Population-guided parallel policy search for reinforcement learning. In: Proceedings of the 8th International Conference on Learning Representations. 2020
|
[156] |
Pourchot A, Perrin N, Sigaud O. Importance mixing: Improving sample reuse in evolutionary policy search methods. 2018, arXiv preprint arXiv:1808.05832
|
[157] |
Stork J, Zaefferer M, Bartz-Beielstein T, Eiben A E. Surrogate models for enhancing the efficiency of neuroevolution in reinforcement learning. In: Proceedings of the 2019 Conference on Genetic and Evolutionary Computation. 2019, 934–942
CrossRef
Google scholar
|
[158] |
Bibi A, Bergou E H, Sener O, Ghanem B, Richtárik P. A stochastic derivative-free optimization method with importance sampling: theory and learning to control. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 3275–3282
CrossRef
Google scholar
|
[159] |
Chen X, Liu S, Xu K, Li X, Lin X, Hong M, Cox D D. ZO-AdaMM: aeroth-order adaptive momentum method for black-box optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2019, 7202–7213
|
[160] |
Gorbunov E A, Bibi A, Sener O, Bergou E H, Richtárik P. A stochastic derivative free optimization method with momentum. In: Proceedings of the 8th International Conference on Learning Representations. 2020
|
[161] |
Kandasamy K, Schneider J, Poczos B. High dimensional Bayesian optimization and bandits via additive models. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 295–304
|
[162] |
Wang Z, Zoghi M, Hutter F, Matheson D, Freitas N D. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 2016, 55: 361–387
CrossRef
Google scholar
|
[163] |
Qian H, Hu Y Q, Yu Y. Derivative-free optimization of high-dimensional non-convex functions by sequential random embeddings. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016, 1946–1952
|
[164] |
Yang P, Tang K, Yao X. Turning high-dimensional optimization into computationally expensive optimization. IEEE Transactions on Evolutionary Computation, 2018, 22(1): 143–156
CrossRef
Google scholar
|
[165] |
Mutny M, Krause A. Efficient high dimensional Bayesian optimization with additivity and quadrature fourier features. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2018, 9019–9030
|
[166] |
Müller N, Glasmachers T. Challenges in high-dimensional reinforcement learning with evolution strategies. In: Proceedings of the 15th International Conference on Parallel Problem Solving from Nature. 2018, 411–423
CrossRef
Google scholar
|
[167] |
Li Z, Zhang Q, Lin X, Zhen H L. Fast covariance matrix adaptation for large-scale black-box optimization. IEEE Transaction on Cybernetics, 2020, 50(5): 2073–2083
CrossRef
Google scholar
|
[168] |
Wang H, Qian H, Yu Y. Noisy derivative-free optimization with value suppression. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 1447–1454
|
/
〈 | 〉 |