Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs
Yifan Lin , Yuhao Wang , Enlu Zhou
Journal of Systems Science and Systems Engineering ›› 2023, Vol. 32 ›› Issue (3) : 267 -288.
In this paper we consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion. At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward. In particular, we consider mean-variance as the risk criterion, and the best arm is the one with the largest mean-variance reward. We apply the Thompson sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm. For T rounds, K actions, and d-dimensional feature vectors, we prove a regret bound of $O\left({\left({1 + \rho + {1 \over \rho}} \right)d\,\ln \,T\ln {K \over \delta}\sqrt {dK{T^{1 + 2}}\ln {K \over \delta}{1 \over}}} \right)$ that holds with probability 1 − δ under the mean-variance criterion with risk tolerance ρ, for any $0 < \in < \frac{1}{2},0 < \delta < 1$. The empirical performance of our proposed algorithms is demonstrated via a portfolio selection problem.
Multi-armed bandit / context / risk-averse / Thompson sampling
| [1] |
Abbasi-yadkori Y, Pál D, Reyzin L, Szepesvári C (2011). Improved algorithms for linear stochastic bandits. Proceedings of the 25th International Conference on Neural Information Processing Systems. Granada, Spain, December 12–14, 2011. |
| [2] |
Agrawal S, Goyal N (2013). Thompson sampling for contextual bandits with linear payoffs. Proceedings of the 30th International Conference on Machine Learning. Atlanta, USA, June 16–21, 2013. |
| [3] |
Ang M L, Lim E Y, Chang J Q (2021). Thompson sampling for Gaussian entropic risk bandits. arXiv preprint arXiv:2105.06960. |
| [4] |
|
| [5] |
|
| [6] |
Baudry D, Gautron R, Kaufmann E, Maillard O (2021). Optimal Thompson sampling strategies for support-aware cvar bandits. Proceedings of the 38th International Conference on Machine Learning. Virtual Event, July 18–24, 2021. |
| [7] |
Cassel A, Mannor S, Zeevi A (2018). A general approach to multi-armed bandits under risk criteria. Proceedings of the 31st Conference on Learning Theory. Singapore, October 6–9, 2013. |
| [8] |
Chang J Q, Zhu Q, Tan V (2020). Risk-constrained Thompson sampling for cvar bandits. arXiv preprint arXiv:2011.08046. |
| [9] |
Chapelle O, Li L (2011). An empirical evaluation of Thompson sampling. Proceedings of the 25th International Conference on Neural Information Processing Systems. Granada, Spain, December 12–14, 2011. |
| [10] |
Chu W, Li L, Reyzin L, Schapire R (2011). Contextual bandits with linear payoff functions. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale, USA, April 11–13, 2011. |
| [11] |
Galichet N, Sebag M, Teytaud O (2013). Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. Proceedings of the 5th Asian Conference on Machine Learning. Canberra, ACT, Australia, November 13–15, 2013. |
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
Li L, Chu W, Langford J, Schapire R E (2010). A contextual-bandit approach to personalized news article recommendation. Proceedings of the 19th International Conference on World Wide Web. Raleigh, North Carolina, USA, April 26–30, 2010. |
| [17] |
|
| [18] |
Maillard O (2013). Robust risk-averse stochastic multi-armed bandits. Proceedings of the 24th International Conference on Algorithmic Learning Theory. Singapore, October 6–9, 2013. |
| [19] |
|
| [20] |
Nagy L, Ormos M (2018). Review of Global Industry Classification. Proceedings of the 32nd European Conference on Modelling and Simulation. Wilhelmshaven, Germany, May 22–25, 2018. |
| [21] |
|
| [22] |
Sani A, Lazaric A, Munos R (2012). Risk-aversion in multi-armed bandits. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, USA, December 3–6, 2012. |
| [23] |
|
| [24] |
Shen W, Wang J, Jiang Y, Zha H (2015). Portfolio choices with orthogonal bandit learning. Proceedings of the 24th International Joint Conference on Artificial Intelligence. Buenos Aires, Argentina, July 25–31, 2015. |
| [25] |
|
| [26] |
Tran M, Nguyen T, Dao V (2021). A practical tutorial on variational Bayes. arXiv preprint arXiv:2103.01327. |
| [27] |
Vakili S, Zhao Q (2015). Mean-variance and value at risk in multi-armed bandit problems. Proceedings of the 53rd Annual Allerton Conference on Communication, Control, and Computing. Illinois, USA, September 29–October 2, 2015. |
| [28] |
|
| [29] |
Wang Y, Blei D (2019). Variational bayes under model misspecification. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, BC, Canada, December 8–14, 2019. |
| [30] |
Xu P, Zheng H, Mazumdar E, Azizzadenesheli K, Anandkumar A (2022). Langevin monte carlo for contextual bandits. Proceedings of the 39th International Conference on Machine Learning. Maryland, USA, July 17–23, 2022. |
| [31] |
Zhu M, Zheng X, Wang Y, Liang Q, Zhang W (2020). Online portfolio selection with cardinality constraint and transaction costs based on contextual bandit. Proceedings of the 30th International Joint Conference on Artificial Intelligence. Virtual Event, January 7–15, 2021. |
| [32] |
Zhu Q, Tan V (2020). Thompson sampling algorithms for mean-variance bandits. Proceedings of the 37th International Conference on Machine Learning. Virtual Event, July 13–18, 2020. |
| [33] |
Zimin A, Ibsen-Jensen R, Chatterjee K (2014). Generalized risk-aversion in stochastic multi-armed bandits. arXiv preprint arXiv:1405.0833. |
/
| 〈 |
|
〉 |