Efficient policy evaluation by matrix sketching

Cheng CHEN; Weinan ZHANG; Yong YU

doi:10.1007/s11704-021-0354-4

PDF(2745 KB)

Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (5) : 165330. DOI: 10.1007/s11704-021-0354-4

Artificial Intelligence

RESEARCH ARTICLE

Efficient policy evaluation by matrix sketching

Author information +

History +

Abstract

In the reinforcement learning, policy evaluation aims to predict long-term values of a state under a certain policy. Since high-dimensional representations become more and more common in the reinforcement learning, how to reduce the computational cost becomes a significant problem to the policy evaluation. Many recent works focus on adopting matrix sketching methods to accelerate least-square temporal difference (TD) algorithms and quasi-Newton temporal difference algorithms. Among these sketching methods, the truncated incremental SVD shows better performance because it is stable and efficient. However, the convergence properties of the incremental SVD is still open. In this paper, we first show that the conventional incremental SVD algorithms could have enormous approximation errors in the worst case. Then we propose a variant of incremental SVD with better theoretical guarantees by shrinking the singular values periodically. Moreover, we employ our improved incremental SVD to accelerate least-square TD and quasi-Newton TD algorithms. The experimental results verify the correctness and effectiveness of our methods.

Graphical abstract

Keywords

temporal difference learning / policy evaluation / matrix sketching

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Cheng CHEN, Weinan ZHANG, Yong YU. Efficient policy evaluation by matrix sketching. Front. Comput. Sci., 2022, 16(5): 165330 https://doi.org/10.1007/s11704-021-0354-4

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Sutton R S, Barto A G. Reinforcement Learning: an Introduction. 2nd ed. London: MIT Press, 2018

[2]	Bertsekas D P, Tsitsiklis J N. Neuro-dynamic programming: an overview. In: Proceedings of the 34th IEEE Conference on Decision and Control. 1995, 560–564

[3]	Lagoudakis M G , Parr R . Least-squares policy iteration. Journal of Machine Learning Research, 2003, 4 : 1107– 1149

[4]	Dann C , Neumann G , Peters J . Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014, 15( 1): 809– 883

[5]	Geist M , Scherrer B . Off-policy learning with eligibility traces: a survey. The Journal of Machine Learning Research, 2014, 15( 1): 289– 333

[6]	Liang Y T, Machado M C, Talvitie E, Bowling M. State of the art control of Atari games using shallow reinforcement learning. In: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems. 2016, 485– 493

[7]	Tagorti M, Scherrer B. On the rate of convergence and error bounds for LSTD (λ). In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1521−1529

[8]	Sutton R S . Learning to predict by the methods of temporal differences. Machine Learning, 1988, 3( 1): 9– 44

[9]	Sutton R S, Szepesvári C, Maei H R. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In: Proceedings of the 21st International Conference on Neural Information Processing Systems. 2008, 1609−1616

[10]	Boyan J A. Technical update: least-squares temporal difference learning. Machine Learning, 2002, 49( 2– 3): 2– 3

[11]	Geramifard A, Bowling M H, Sutton R S. Incremental least-squares temporal difference learning. In: Proceedings of the 21st National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference. 2006, 356– 361

[12]	Geramifard A, Bowling M H, Zinkevich M, Sutton R S. iLSTD: eligibility traces and convergence analysis. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems. 2006, 441– 448

[13]	Pan Y C, White A M, White M. Accelerated gradient temporal difference learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 2464−2470

[14]	Ghavamzadeh M, Lazaric A, Maillard O A, Munos R. LSTD with random projections. In: Proceedings of Advances in Neural Information Processing Systems: 24th Annual Conference on Neural Information Processing Systems 2010. 2010, 721– 729

[15]	Pan Y C, Azer E S, White M. Effective sketching methods for value function approximation. In: Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. 2017

[16]	Gehring C, Pan Y C, White M. Incremental truncated LSTD. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016, 1505−1511

[17]	Li H F, Xia Y C, Zhang W S. Finite sample analysis of LSTD with random projections and eligibility traces. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2390−2396

[18]	Woodruff D P. Sketching as a tool for numerical linear algebra. Foundations and Trends^® in Theoretical Computer Science , 2014, 10( 1– 2): 1– 2

[19]	Bertsekas D P. Dynamic Programming and Optimal Control. Volume 1. Belmont: Athena Scientific, 1995

[20]	Kolter J Z, Ng A Y. Regularization and feature selection in least-squares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 521– 528

[21]	Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 1996, 22( 1– 3): 1– 3

[22]	Liberty E. Simple and deterministic matrix sketching. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 581– 588

[23]	Ghashami M , Liberty E , Phillips J M , Woodruff D P . Frequent directions: simple and deterministic matrix sketching. SIAM Journal on Computing, 2016, 45( 5): 1762– 1792

[24]	Kuzborskij I, Cella L, Cesa-Bianchi N. Efficient linear bandits through matrix sketching. In: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. 2019, 177– 185

[25]	Luo H P, Agarwal A, Cesa-Bianchi N, Langford J. Efficient second order online learning by sketching. In: Proceedings of Advances in Neural Information Processing Systems. 2016, 902– 910

[26]	Luo L , Chen C , Zhang Z H , Li W J , Zhang T . Robust frequent directions with application in online learning. Journal of Machine Learning Research, 2019, 20( 45): 1– 41

[27]	Mroueh Y, Marcheret E, Goel V. Co-occurring directions sketching for approximate matrix multiply. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 567– 575

[28]	Brand M. Fast online SVD revisions for lightweight recommender systems. In: Proceedings of the 2003 SIAM International Conference on Data Mining. 2003, 37– 46

[29]	Sarwar B, Karypis G, Konstan J, Riedl J. Incremental singular value decomposition algorithms for highly scalable recommender systems. In: Proceedings of the 5th International Conference on Computer and Information Science. 2002, 27– 28

[30]	Ross D A, Lim J, Lin R S, Yang M H. Incremental learning for robust visual tracking. International Journal of Computer Vision, 2008, 77( 1– 3): 1– 3

[31]	Hall P M, Marshall D, Martin R R. Incremental eigenanalysis for classification. In: Proceedings of the British Machine Vision Conference. 1998, 1– 10

[32]	Brand M. Incremental singular value decomposition of uncertain data with missing values. In: Proceedings of the 7th European Conference on Computer Vision. 2002, 707– 720

[33]	Brand M . Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications, 2006, 415( 1): 20– 30

[34]	Salas D F , Powell W B . Benchmarking a scalable approximate dynamic programming algorithm for stochastic control of grid-level energy storage. INFORMS Journal on Computing, 2018, 30( 1): 106– 123

Acknowledgements

The corresponding author Weinan Zhang was supported by the “New Generation of AI 2030” Major Project (2018AAA0100900) and the National Natural Science Foundation of China (Grant Nos. 62076161, 61772333, 61632017).