A perspective on off-policy evaluation in reinforcement learning

Lihong LI

PDF(189 KB)
PDF(189 KB)
Front. Comput. Sci. ›› 2019, Vol. 13 ›› Issue (5) : 911-912. DOI: 10.1007/s11704-019-9901-7
PERSPECTIVE

A perspective on off-policy evaluation in reinforcement learning

Author information +
History +

Cite this article

Download citation ▾
Lihong LI. A perspective on off-policy evaluation in reinforcement learning. Front. Comput. Sci., 2019, 13(5): 911‒912 https://doi.org/10.1007/s11704-019-9901-7

References

[1]
Bottou L, Peters J, Quiñonero-Candela J, Charles D X, Chickering D M, Portugaly E, Ray D, Simard P, Snelson E. Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research, 2013, 14(1): 3207–3260
[2]
Hofmann K, Li L, Radlinski F. Online evaluation for information retrieval. Foundations and Trends in Information Retrieval, 2016, 10(1): 1–117
CrossRef Google scholar
[3]
Li L, Chu W, Langford J, Schapire R E. A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 661–670
CrossRef Google scholar
[4]
Dudík M, Langford J, Li L. Doubly robust policy evaluation and learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 1097–1104
[5]
Swaminathan A, Joachims T. The selfnormalized estimator for counterfactual learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 3231–3239
[6]
Wang Y X, Agarwal A, Dudík M. Optimal and adaptive off-policy evaluation in contextual bandits. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 3589–3597
[7]
Jiang N, Li L. Doubly robust off-policy evaluation for reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 652–661
[8]
Li L, Munos R, Szepesvári C. Toward minimax off-policy value estimation. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. 2015, 608–616
[9]
Precup D, Sutton R S, Singh S P. Eligibility traces for off-policy policy evaluation. In: Proceedings of the 17th International Conference on Machine Learning. 2000, 759–766
[10]
Liu Q, Li L, Tang Z, Zhou D. Breaking the curse of horizon: infinitehorizon off-policy estimation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2018, 5361–5371

RIGHTS & PERMISSIONS

2019 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature
AI Summary AI Mindmap
PDF(189 KB)

Accesses

Citations

Detail

Sections
Recommended

/