A perspective on off-policy evaluation in reinforcement learning
Lihong LI
A perspective on off-policy evaluation in reinforcement learning
[1] |
Bottou L, Peters J, Quiñonero-Candela J, Charles D X, Chickering D M, Portugaly E, Ray D, Simard P, Snelson E. Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research, 2013, 14(1): 3207–3260
|
[2] |
Hofmann K, Li L, Radlinski F. Online evaluation for information retrieval. Foundations and Trends in Information Retrieval, 2016, 10(1): 1–117
CrossRef
Google scholar
|
[3] |
Li L, Chu W, Langford J, Schapire R E. A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 661–670
CrossRef
Google scholar
|
[4] |
Dudík M, Langford J, Li L. Doubly robust policy evaluation and learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 1097–1104
|
[5] |
Swaminathan A, Joachims T. The selfnormalized estimator for counterfactual learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 3231–3239
|
[6] |
Wang Y X, Agarwal A, Dudík M. Optimal and adaptive off-policy evaluation in contextual bandits. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 3589–3597
|
[7] |
Jiang N, Li L. Doubly robust off-policy evaluation for reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 652–661
|
[8] |
Li L, Munos R, Szepesvári C. Toward minimax off-policy value estimation. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. 2015, 608–616
|
[9] |
Precup D, Sutton R S, Singh S P. Eligibility traces for off-policy policy evaluation. In: Proceedings of the 17th International Conference on Machine Learning. 2000, 759–766
|
[10] |
Liu Q, Li L, Tang Z, Zhou D. Breaking the curse of horizon: infinitehorizon off-policy estimation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2018, 5361–5371
|
/
〈 | 〉 |