Measuring Policy Performance in Online Pricing with Offline Data: Worst-case Perspective and Bayesian Perspective
Yue Wang , Zeyu Zheng
Journal of Systems Science and Systems Engineering ›› 2023, Vol. 32 ›› Issue (3) : 352 -371.
Measuring Policy Performance in Online Pricing with Offline Data: Worst-case Perspective and Bayesian Perspective
The problems of online pricing with offline data, among other similar online decision making with offline data problems, aim at designing and evaluating online pricing policies in presence of a certain amount of existing offline data. To evaluate pricing policies when offline data are available, the decision maker can either position herself at the time point when the offline data are already observed and viewed as deterministic, or at the time point when the offline data are not yet generated and viewed as stochastic. We write a framework to discuss how and why these two different positions are relevant to online policy evaluations, from a worst-case perspective and from a Bayesian perspective. We then use a simple online pricing setting with offline data to illustrate the constructions of optimal policies for these two approaches and discuss their differences, especially whether we can decompose the searching for the optimal policy into independent subproblems and optimize separately, and whether there exists a deterministic optimal policy.
Online pricing / offline data / performance measure / worst-case approach / Bayesian approach
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
Bu J, Simchi-Levi D, Xu Y (2020). Online pricing with offline data: Phase transition and inverse square law. In International Conference on Machine Learning. PMLR. |
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
Eysenbach B, Salakhutdinov R R, Levine S (2019). Search on the replay buffer: Bridging planning and reinforcement learning. arXiv: 1906.05253. |
| [9] |
Fujimoto S, Meger D, Precup D (2019). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning. PMLR. |
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
Kirschner J, Krause A (2018). Information directed sampling and bandits with heteroscedastic noise. In Conference on Learning Theory. PMLR. |
| [14] |
Munos R, Stepleton T, Harutyunyan A, Bellemare M (2016). Safe and efficient off-policy reinforcement learning. arXiv: 1606.02647. |
| [15] |
|
| [16] |
Rakelly K, Zhou A, Finn C, Levine S, Quillen D (2019). Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning. PMLR. |
| [17] |
Rolnick D, Ahuja A, Schwarz J, Lillicrap T, Wayne G (2019). Experience replay for continual learning. arXiv: 1811.11682. |
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
Thomas P, Brunskill E (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR. |
| [23] |
Wang Y, Wang L (2020). Causal inference in degenerate systems: An impossibility result. In International Conference on Artificial Intelligence and Statistics. PMLR. |
| [24] |
Wang Y, Zheng Z, Shen Z-J M (2023). Online pricing with polluted offline data. Available at SSRN 4320324. |
| [25] |
Zanette A, Brandfonbrener D, Brunskill E, Pirotta M, Lazaric A (2020). Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics. PMLR. |
/
| 〈 |
|
〉 |