PDF
Abstract
In this paper we consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. In particular, we consider mean-variance as the risk criterion, and the best arm is the one with the largest mean-variance reward. We apply the Thompson sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm. For T rounds, K actions, and d-dimensional feature vectors, we prove a regret bound of $O\left({\left({1 + \rho + {1 \over \rho}} \right)d\,\ln \,T\ln {K \over \delta}\sqrt {dK{T^{1 + 2}}\ln {K \over \delta}{1 \over}}} \right)$ that holds with probability 1 − δ under the mean-variance criterion with risk tolerance ρ, for any $0 < \in < \frac{1}{2},0 < \delta < 1$. The empirical performance of our proposed algorithms is demonstrated via a portfolio selection problem.
Keywords
Multi-armed bandit
/
context
/
risk-averse
/
Thompson sampling
Cite this article
Download citation ▾
Yifan Lin, Yuhao Wang, Enlu Zhou.
Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs.
Journal of Systems Science and Systems Engineering, 2023, 32(3): 267-288 DOI:10.1007/s11518-022-5541-9
| [1] |
Abbasi-yadkori Y, Pál D, Reyzin L, Szepesvári C (2011). Improved algorithms for linear stochastic bandits. Proceedings of the 25th International Conference on Neural Information Processing Systems. Granada, Spain, December 12–14, 2011.
|
| [2] |
Agrawal S, Goyal N (2013). Thompson sampling for contextual bandits with linear payoffs. Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA, June 16–21, 2013.
|
| [3] |
Ang M L, Lim E Y, Chang J Q (2021). Thompson sampling for Gaussian entropic risk bandits. arXiv preprint arXiv:2105.06960.
|
| [4] |
Auer P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 2002, 3(v): 397-422.
|
| [5] |
Auer P, Cesa-Bianchi N, Freund Y, Schapire R E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002, 32(1): 48-77.
|
| [6] |
Baudry D, Gautron R, Kaufmann E, Maillard O (2021). Optimal Thompson sampling strategies for support-aware cvar bandits. Proceedings of the 38th International Conference on Machine Learning. Virtual Event, July 18–24, 2021.
|
| [7] |
Cassel A, Mannor S, Zeevi A (2018). A general approach to multi-armed bandits under risk criteria. Proceedings of the 31st Conference on Learning Theory. Singapore, October 6–9, 2013.
|
| [8] |
Chang J Q, Zhu Q, Tan V (2020). Risk-constrained Thompson sampling for cvar bandits. arXiv preprint arXiv:2011.08046.
|
| [9] |
Chapelle O, Li L (2011). An empirical evaluation of Thompson sampling. Proceedings of the 25th International Conference on Neural Information Processing Systems. Granada, Spain, December 12–14, 2011.
|
| [10] |
Chu W, Li L, Reyzin L, Schapire R (2011). Contextual bandits with linear payoff functions. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale, USA, April 11–13, 2011.
|
| [11] |
Galichet N, Sebag M, Teytaud O (2013). Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. Proceedings of the 5th Asian Conference on Machine Learning. Canberra, ACT, Australia, November 13–15, 2013.
|
| [12] |
Huo X, Fu F. Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society Open Science, 2017, 4(11): 171377.
|
| [13] |
Khajonchotpanya N, Xue Y, Rujeerapaiboon N. A revised approach for risk-averse multi-armed bandits under cvar criterion. Operations Research Letters, 2021, 49(4): 465-472.
|
| [14] |
Kleijn B, Bas J K, Van D V A W. The bernstein-vonmises theorem under misspecification. Electronic Journal of Statistics, 2012, 6: 354-381.
|
| [15] |
Kullback S, Leibler R A. On information and sufficiency. The Annals of Mathematical Statistics, 1951, 22(1): 79-86.
|
| [16] |
Li L, Chu W, Langford J, Schapire R E (2010). A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web. Raleigh, North Carolina, USA, April 26–30, 2010.
|
| [17] |
Liu X, Derakhshani M, Lambotharan S, Van D S M. Risk-aware multi-armed bandits with refined upper confidence bounds. IEEE Signal Processing Letters, 2020, 28: 269-273.
|
| [18] |
Maillard O (2013). Robust risk-averse stochastic multi-armed bandits. Proceedings of the 24th International Conference on Algorithmic Learning Theory. Singapore, October 6–9, 2013.
|
| [19] |
Markowitz H. Portfolio selection. The Journal of Finance, 1952, 7(1): 77-91.
|
| [20] |
Nagy L, Ormos M (2018). Review of Global Industry Classification. Proceedings of the 32nd European Conference on Modelling and Simulation. Wilhelmshaven, Germany, May 22–25, 2018.
|
| [21] |
Rockafellar R T, Uryasev S. Optimization of conditional value-at-risk. Journal of Risk, 2000, 2: 21-41.
|
| [22] |
Sani A, Lazaric A, Munos R (2012). Risk-aversion in multi-armed bandits. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA, December 3–6, 2012.
|
| [23] |
Schlaifer R, Raiffa H. Applied statistical decision theory, 1916, Cambridge: Harvard University.
|
| [24] |
Shen W, Wang J, Jiang Y, Zha H (2015). Portfolio choices with orthogonal bandit learning. Proceedings of the 24th International Joint Conference on Artificial Intelligence. Buenos Aires, Argentina, July 25–31, 2015.
|
| [25] |
Thompson W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933, 25(3–4): 285-294.
|
| [26] |
Tran M, Nguyen T, Dao V (2021). A practical tutorial on variational Bayes. arXiv preprint arXiv:2103.01327.
|
| [27] |
Vakili S, Zhao Q (2015). Mean-variance and value at risk in multi-armed bandit problems. Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing. Illinois, USA, September 29–October 2, 2015.
|
| [28] |
Vakili S, Zhao Q. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 2016, 10(6): 1093-1111.
|
| [29] |
Wang Y, Blei D (2019). Variational bayes under model misspecification. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, BC, Canada, December 8–14, 2019.
|
| [30] |
Xu P, Zheng H, Mazumdar E, Azizzadenesheli K, Anandkumar A (2022). Langevin monte carlo for contextual bandits. Proceedings of the 39th International Conference on Machine Learning. Maryland, USA, July 17–23, 2022.
|
| [31] |
Zhu M, Zheng X, Wang Y, Liang Q, Zhang W (2020). Online portfolio selection with cardinality constraint and transaction costs based on contextual bandit. Proceedings of the 30th International Joint Conference on Artificial Intelligence. Virtual Event, January 7–15, 2021.
|
| [32] |
Zhu Q, Tan V (2020). Thompson sampling algorithms for mean-variance bandits. Proceedings of the 37th International Conference on Machine Learning. Virtual Event, July 13–18, 2020.
|
| [33] |
Zimin A, Ibsen-Jensen R, Chatterjee K (2014). Generalized risk-aversion in stochastic multi-armed bandits. arXiv preprint arXiv:1405.0833.
|