Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Shan ZHONG, Quan LIU, Zongzhang ZHANG, Qiming FU

PDF(1290 KB)
PDF(1290 KB)
Front. Comput. Sci. ›› 2019, Vol. 13 ›› Issue (1) : 106-126. DOI: 10.1007/s11704-017-6222-6
RESEARCH ARTICLE

Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation

Author information +
History +

Abstract

Dyna is an effective reinforcement learning (RL) approach that combines value function evaluation with model learning. However, existing works on Dyna mostly discuss only its efficiency in RL problems with discrete action spaces. This paper proposes a novel Dyna variant, called Dyna-LSTD-PA, aiming to handle problems with continuous action spaces. Dyna-LSTD-PA stands for Dyna based on least-squares temporal difference (LSTD) and policy approximation. Dyna-LSTD-PA consists of two simultaneous, interacting processes. The learning process determines the probability distribution over action spaces using the Gaussian distribution; estimates the underlying value function, policy, and model by linear representation; and updates their parameter vectors online by LSTD(λ). The planning process updates the parameter vector of the value function again by using offline LSTD(λ). Dyna-LSTD-PA also uses the Sherman–Morrison formula to improve the efficiency of LSTD(λ), and weights the parameter vector of the value function to bring the two processes together. Theoretically, the global error bound is derived by considering approximation, estimation, and model errors. Experimentally, Dyna-LSTD-PA outperforms two representative methods in terms of convergence rate, success rate, and stability performance on four benchmark RL problems.

Keywords

problem solving / control methods / heuristic search methods / dynamic programming

Cite this article

Download citation ▾
Shan ZHONG, Quan LIU, Zongzhang ZHANG, Qiming FU. Efficient reinforcement learning in continuous state and action spaces with Dyna and policy approximation. Front. Comput. Sci., 2019, 13(1): 106‒126 https://doi.org/10.1007/s11704-017-6222-6

References

[1]
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998
[2]
Mnih V, Kavukcuoglu K, Silver D, Rusu A, Veness J, Bellemare M, Graves A, Riedmiller M, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D,Legg S, Hassabis D. Human level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
CrossRef Google scholar
[3]
Littman M L. Reinforcement learning improves behaviour from evaluative feedback. Nature, 2015, 521(7553): 445–451
CrossRef Google scholar
[4]
Andersson O, Heintz F, Doherty P. On the undecidability of probabilistic planning and infinite-horizon partially observable. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 2497–2503
[5]
Bellman R E, Dreyfus S E. Applied Dynamic Programming. Princeton, NJ: Princeton University Press, 2015
[6]
Barker J K, Korf R E. Limitations of front-to-end bidirectional heuristic search. In: Proceedings of the 29th National Conference on Artificial Intelligence. 2015, 1086–1092
[7]
Robert C P, Casella G. Monte Carlo Statistical Methods. New York: Springer Science & Business Media, 2013
[8]
Sutton R, Mahmood A R, Precup D, Hasselt H V. A new Q (λ) with interim forward view and Monte-Carlo equivalence. In: Proceedings of International Conference on Machine Learning. 2014, 568–576
[9]
Seijen H V, Sutton R. True online TD (λ). In: Proceedings of International Conference on Machine Learning. 2014, 692–700
[10]
Hasselt H V, Mahmood A R, Sutton R S. Off-policy TD (λ) with a true online equivalence. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2014, 330–339
[11]
Werbos P J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems, 1977, 22(6): 25–38
[12]
Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943–949
CrossRef Google scholar
[13]
Wang F Y, Jin N, Liu D E, Wei Q L. Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Transactions on Neural Networks, 2011, 22(1): 24–36
CrossRef Google scholar
[14]
Liu D, Wei Q L. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(3): 621–634
CrossRef Google scholar
[15]
Murray J J, Cox C J, Lendaris G G, Saeks R. Adaptive dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2002, 32(2): 140–153
[16]
Hanselmann T, Noakes L, Zaknich A. Continuous time adaptive critics. IEEE Transactions on Neural Networks, 2007, 18(3): 631–647
CrossRef Google scholar
[17]
Wei Q, Song R, Yan P. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Transactions on Neural Networks and Learning Systems, 2016, 27(2): 444–458
CrossRef Google scholar
[18]
Busoniu L, Babuška R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010
CrossRef Google scholar
[19]
Sutton R S. Integrated architecture for learning, planning and reacting based on approximating dynamic programming. In: Proceedings of International Conference on Machine Learning. 1990, 216–224
CrossRef Google scholar
[20]
Peng J, Williams R J. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1993, 4(1): 437–454
CrossRef Google scholar
[21]
Moore A W, Atkeson C G. Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning, 1993, 13(1): 103–130
CrossRef Google scholar
[22]
Sutton R S, Szepesvári C, Geramfard A, Bowling M. Dyna-style planning with linear function approximation and prioritized sweeping. In: Proceedings of International Conference on Uncertainty in Artificial Intelligence. 2008, 528–536
[23]
Silver D, Sutton R S, Müller M. Temporal-difference search in computer Go. Machine Learning, 2012, 87(2): 183–219
CrossRef Google scholar
[24]
Coulom R. Efficient selectivity and backup operators in Monte-Carlo tree search. In: Proceedings of the 5th International Conference on Computers and Games, 2006, 72–83
[25]
Zhou Y C, Liu Q, Fu Q M, Zhang Z Z. Trajectory sampling value iteration: Improved Dyna search for MDPs. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2015, 1685–1686
[26]
Martin H, De Lope J. Ex<α>: An effective algorithm for continuous actions reinforcement learning problems. In: Proceedings of the 35th Annual Conference of IEEE on Industrial Electronics. 2009, 2063–2068
[27]
Weinstein A, Littman M L. Bandit-based planning and learning in continuous-action Markov decision processes. In: Proceedings of International Conference on Automated Planning and Scheduling. 2012, 306–314
[28]
Busoniu L, Daniels A, Munos R, Babuška R. Optimistic planning for continuous-action deterministic systems. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. 2013, 69–76
CrossRef Google scholar
[29]
Grondman I, Vaandrager M, Busoniu L, Babuška R, Schuitema E. Efficient model learning methods for actor-critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2012, 42(3): 591–602
CrossRef Google scholar
[30]
Hasselt H V. Reinforcement learning in continuous state and action spaces. In: Wiering M, van Otterlo M, eds. Reinforcement Learning. Berlin: Springer Heidelberg, 2012, 207–251
CrossRef Google scholar
[31]
Degris T, Pilarski P M, Sutton R S. Model-free reinforcement learning with continuous action in practice. In: Proceedings of IEEE American Control Conference. 2012, 2177–2182
CrossRef Google scholar
[32]
Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22(1–3): 33–57
CrossRef Google scholar
[33]
Boyan J A. Technical update: least-square temporal difference learning. Machine learning, 2002, 49(2–3): 233–246
CrossRef Google scholar
[34]
Tagorti M, Scherer B. On the rate of the convergence and error bounds for LSTD(λ). In: Proceedings of International Conference on Machine Learning. 2015, 528–536
[35]
Hwang K S, Jiang W C, Chen Y J. Model learning and knowledge sharing for a multiagent system With Dyna-Q. IEEE Transactions on Cybernetics, 2015, 45(5): 964–976
[36]
Faulkner R, Precup D. Dyna planning using a feature based generative model. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 1–9
[37]
Silver D, Sutton R S, Müller M. Reinforcement learning of local shape in the game of Go. In: Proceedings of International Joint Conference on Artificial Intelligence. 2007, 1053–1058
[38]
Yao H, Szepesvári C. Approximate policy iteration with linear action models. In: Proceedings of the 26th National Conference on Artificial Intelligence. 2012
[39]
Tsitsiklis J N, Roy B V. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674–690
CrossRef Google scholar
[40]
Nedic A, Bertsekas D P. Least squares policy evaluation algorithms with linear function approximation. Theory and Applications, 2002, 13(1–2): 79–110
[41]
Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of leastsquare policy iteration. Journal of Machine Learning Research, 2012, 13(4): 3041–3074
[42]
Xu X, He H G, Hu D W. Efficient reinforcement learning using recursive least-squares methods. Journal of Artificial Intelligence Research, 2002, 16(1): 259–292
[43]
Ljung L, Soderstron T. Theory and practice of recursive identification. Cambridge, MA: MIT Press, 1983
[44]
Geramifard A, Bowling M, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 356–361
[45]
Berenji H R, Khedkar P. Learning and tuning fuzzy logic controllers through reinforcements. IEEE Transactions on Neural Networks, 1992, 3(5): 724–740
CrossRef Google scholar
[46]
Sutton R S. Generalization in reinforcement learning: successful examples using sparse coarse coding. Neural Information Processing Systems, 1996: 1038–1044
[47]
Bhatnagar S, Sutton R S, Ghavamzadeh M, Lee M. Natural actor-critic algorithms. Automatica, 2009, 45(11): 2471–2482
CrossRef Google scholar
[48]
Liu D R, Li H L, Wang D. Feature selection and feature learning forhigh-dimensional batch reinforcement learning: a survey. InternationalJournal of Automation and Computing, 2015, 12(3): 229–242
CrossRef Google scholar
[49]
Farahmand A M, Ghavamzadeh M, Szepesvári C, Mannor S. Regularizedfitted Q-iteration for planning in continuous-space Markoviandecision problems. In: Proceedings of American Control Conference.2009, 725–730
[50]
Farahmand A M, Szepesvári C. Model selection in reinforcementlearning. Machine Learning, 2011, 85(3): 299–332
CrossRef Google scholar
[51]
Kolter J Z, Ng A Y. Regularization and feature selection in leastsquarestemporal difference learning. In: Proceedings of the 26th AnnualInternational Conference on Machine Learning. 2009, 521–528
[52]
Ghavamzadeh M, Lazaric A, Munos R, Hoffman M W. Finite-sampleanalysis of LASSO-TD. In: Proceedings of the 28th International Conferenceon Machine Learning. 2011, 1177–1184
[53]
Mahadevan S, Liu B. Sparse Q-learning with mirror descent. In: Proceedingsof the 28th Conference on Uncertainty in Artificial Intelligence.2012, 564–573
[54]
Painter-Wakefield C, Parr R. Greedy algorithms for sparse reinforcementlearning. In: Proceedings of the 29th International Conference onMachine Learning. 2012, 1391–1398
[55]
Johns J, Mahadevan S. Sparse approximate policy evaluation usinggraph-based basis functions. Technical Report UM-CS-2009-041.2009
[56]
Ghavamzadeh M, Lazaric A, Maillard O A, Munos R. LSTD with randomprojections. In: Proceedings of Advances in Neural InformationProcessing Systems. 2010, 721–729
[57]
Liu B, Mahadevan S. Compressive reinforcement learning with obliquerandom projections. Technical Report UM-CS-2011-024. 2011
[58]
Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration forreinforcement learning. IEEE Transactionson Neural Networks, 2007,18(4): 973–992
CrossRef Google scholar

RIGHTS & PERMISSIONS

2018 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary AI Mindmap
PDF(1290 KB)

Accesses

Citations

Detail

Sections
Recommended

/