Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs

Yifan Lin , Yuhao Wang , Enlu Zhou

Journal of Systems Science and Systems Engineering ›› 2023, Vol. 32 ›› Issue (3) : 267 -288.

PDF
Journal of Systems Science and Systems Engineering ›› 2023, Vol. 32 ›› Issue (3) : 267 -288. DOI: 10.1007/s11518-022-5541-9
Article

Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs

Author information +
History +
PDF

Abstract

In this paper we consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. In particular, we consider mean-variance as the risk criterion, and the best arm is the one with the largest mean-variance reward. We apply the Thompson sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm. For T rounds, K actions, and d-dimensional feature vectors, we prove a regret bound of $O\left({\left({1 + \rho + {1 \over \rho}} \right)d\,\ln \,T\ln {K \over \delta}\sqrt {dK{T^{1 + 2}}\ln {K \over \delta}{1 \over}}} \right)$ that holds with probability 1 − δ under the mean-variance criterion with risk tolerance ρ, for any $0 < \in < \frac{1}{2},0 < \delta < 1$. The empirical performance of our proposed algorithms is demonstrated via a portfolio selection problem.

Keywords

Multi-armed bandit / context / risk-averse / Thompson sampling

Cite this article

Download citation ▾
Yifan Lin, Yuhao Wang, Enlu Zhou. Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs. Journal of Systems Science and Systems Engineering, 2023, 32(3): 267-288 DOI:10.1007/s11518-022-5541-9

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Abbasi-yadkori Y, Pál D, Reyzin L, Szepesvári C (2011). Improved algorithms for linear stochastic bandits. Proceedings of the 25th International Conference on Neural Information Processing Systems. Granada, Spain, December 12–14, 2011.

[2]

Agrawal S, Goyal N (2013). Thompson sampling for contextual bandits with linear payoffs. Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA, June 16–21, 2013.

[3]

Ang M L, Lim E Y, Chang J Q (2021). Thompson sampling for Gaussian entropic risk bandits. arXiv preprint arXiv:2105.06960.

[4]

Auer P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 2002, 3(v): 397-422.

[5]

Auer P, Cesa-Bianchi N, Freund Y, Schapire R E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002, 32(1): 48-77.

[6]

Baudry D, Gautron R, Kaufmann E, Maillard O (2021). Optimal Thompson sampling strategies for support-aware cvar bandits. Proceedings of the 38th International Conference on Machine Learning. Virtual Event, July 18–24, 2021.

[7]

Cassel A, Mannor S, Zeevi A (2018). A general approach to multi-armed bandits under risk criteria. Proceedings of the 31st Conference on Learning Theory. Singapore, October 6–9, 2013.

[8]

Chang J Q, Zhu Q, Tan V (2020). Risk-constrained Thompson sampling for cvar bandits. arXiv preprint arXiv:2011.08046.

[9]

Chapelle O, Li L (2011). An empirical evaluation of Thompson sampling. Proceedings of the 25th International Conference on Neural Information Processing Systems. Granada, Spain, December 12–14, 2011.

[10]

Chu W, Li L, Reyzin L, Schapire R (2011). Contextual bandits with linear payoff functions. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale, USA, April 11–13, 2011.

[11]

Galichet N, Sebag M, Teytaud O (2013). Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. Proceedings of the 5th Asian Conference on Machine Learning. Canberra, ACT, Australia, November 13–15, 2013.

[12]

Huo X, Fu F. Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society Open Science, 2017, 4(11): 171377.

[13]

Khajonchotpanya N, Xue Y, Rujeerapaiboon N. A revised approach for risk-averse multi-armed bandits under cvar criterion. Operations Research Letters, 2021, 49(4): 465-472.

[14]

Kleijn B, Bas J K, Van D V A W. The bernstein-vonmises theorem under misspecification. Electronic Journal of Statistics, 2012, 6: 354-381.

[15]

Kullback S, Leibler R A. On information and sufficiency. The Annals of Mathematical Statistics, 1951, 22(1): 79-86.

[16]

Li L, Chu W, Langford J, Schapire R E (2010). A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web. Raleigh, North Carolina, USA, April 26–30, 2010.

[17]

Liu X, Derakhshani M, Lambotharan S, Van D S M. Risk-aware multi-armed bandits with refined upper confidence bounds. IEEE Signal Processing Letters, 2020, 28: 269-273.

[18]

Maillard O (2013). Robust risk-averse stochastic multi-armed bandits. Proceedings of the 24th International Conference on Algorithmic Learning Theory. Singapore, October 6–9, 2013.

[19]

Markowitz H. Portfolio selection. The Journal of Finance, 1952, 7(1): 77-91.

[20]

Nagy L, Ormos M (2018). Review of Global Industry Classification. Proceedings of the 32nd European Conference on Modelling and Simulation. Wilhelmshaven, Germany, May 22–25, 2018.

[21]

Rockafellar R T, Uryasev S. Optimization of conditional value-at-risk. Journal of Risk, 2000, 2: 21-41.

[22]

Sani A, Lazaric A, Munos R (2012). Risk-aversion in multi-armed bandits. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA, December 3–6, 2012.

[23]

Schlaifer R, Raiffa H. Applied statistical decision theory, 1916, Cambridge: Harvard University.

[24]

Shen W, Wang J, Jiang Y, Zha H (2015). Portfolio choices with orthogonal bandit learning. Proceedings of the 24th International Joint Conference on Artificial Intelligence. Buenos Aires, Argentina, July 25–31, 2015.

[25]

Thompson W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933, 25(3–4): 285-294.

[26]

Tran M, Nguyen T, Dao V (2021). A practical tutorial on variational Bayes. arXiv preprint arXiv:2103.01327.

[27]

Vakili S, Zhao Q (2015). Mean-variance and value at risk in multi-armed bandit problems. Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing. Illinois, USA, September 29–October 2, 2015.

[28]

Vakili S, Zhao Q. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 2016, 10(6): 1093-1111.

[29]

Wang Y, Blei D (2019). Variational bayes under model misspecification. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, BC, Canada, December 8–14, 2019.

[30]

Xu P, Zheng H, Mazumdar E, Azizzadenesheli K, Anandkumar A (2022). Langevin monte carlo for contextual bandits. Proceedings of the 39th International Conference on Machine Learning. Maryland, USA, July 17–23, 2022.

[31]

Zhu M, Zheng X, Wang Y, Liang Q, Zhang W (2020). Online portfolio selection with cardinality constraint and transaction costs based on contextual bandit. Proceedings of the 30th International Joint Conference on Artificial Intelligence. Virtual Event, January 7–15, 2021.

[32]

Zhu Q, Tan V (2020). Thompson sampling algorithms for mean-variance bandits. Proceedings of the 37th International Conference on Machine Learning. Virtual Event, July 13–18, 2020.

[33]

Zimin A, Ibsen-Jensen R, Chatterjee K (2014). Generalized risk-aversion in stochastic multi-armed bandits. arXiv preprint arXiv:1405.0833.

AI Summary AI Mindmap
PDF

171

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/