Measuring Policy Performance in Online Pricing with Offline Data: Worst-case Perspective and Bayesian Perspective

Yue Wang , Zeyu Zheng

Journal of Systems Science and Systems Engineering ›› 2023, Vol. 32 ›› Issue (3) : 352 -371.

PDF
Journal of Systems Science and Systems Engineering ›› 2023, Vol. 32 ›› Issue (3) : 352 -371. DOI: 10.1007/s11518-023-5557-9
Article

Measuring Policy Performance in Online Pricing with Offline Data: Worst-case Perspective and Bayesian Perspective

Author information +
History +
PDF

Abstract

The problems of online pricing with offline data, among other similar online decision making with offline data problems, aim at designing and evaluating online pricing policies in presence of a certain amount of existing offline data. To evaluate pricing policies when offline data are available, the decision maker can either position herself at the time point when the offline data are already observed and viewed as deterministic, or at the time point when the offline data are not yet generated and viewed as stochastic. We write a framework to discuss how and why these two different positions are relevant to online policy evaluations, from a worst-case perspective and from a Bayesian perspective. We then use a simple online pricing setting with offline data to illustrate the constructions of optimal policies for these two approaches and discuss their differences, especially whether we can decompose the searching for the optimal policy into independent subproblems and optimize separately, and whether there exists a deterministic optimal policy.

Keywords

Online pricing / offline data / performance measure / worst-case approach / Bayesian approach

Cite this article

Download citation ▾
Yue Wang, Zeyu Zheng. Measuring Policy Performance in Online Pricing with Offline Data: Worst-case Perspective and Bayesian Perspective. Journal of Systems Science and Systems Engineering, 2023, 32(3): 352-371 DOI:10.1007/s11518-023-5557-9

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Ban G-Y, Keskin N B. Personalized dynamic pricing with machine learning: High-dimensional features and heterogeneous elasticity. Management Science, 2021, 67(9): 5549-5568.

[2]

Bastani H, Simchi-Levi D, Zhu R. Meta dynamic pricing: Transfer learning across experiments. Management Science, 2022, 68(3): 1865-1881.

[3]

Billingsley P. Convergence of Probability Measures, 2013, USA: John Wiley & Sons.

[4]

Bu J, Simchi-Levi D, Xu Y (2020). Online pricing with offline data: Phase transition and inverse square law. In International Conference on Machine Learning. PMLR.

[5]

den Boer A V. Dynamic pricing and learning: Historical origins, current research, and new directions. Surveys in Operations Research and Management Science, 2015, 20(1): 1-18.

[6]

den Boer A V, Zwart B. Dynamic pricing and learning with finite inventories. Operations Research, 2015, 63(4): 965-978.

[7]

Durrett R. Probability: Theory and Examples, 2019, UK: Cambridge University Press.

[8]

Eysenbach B, Salakhutdinov R R, Levine S (2019). Search on the replay buffer: Bridging planning and reinforcement learning. arXiv: 1906.05253.

[9]

Fujimoto S, Meger D, Precup D (2019). Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning. PMLR.

[10]

Gallego G, Topaloglu H. Revenue Management and Pricing Analytics, 2019, USA: Springer.

[11]

Harrison J M, Keskin N B, Zeevi A. Bayesian dynamic pricing policies: Learning and earning under a binary prior distribution. Management Science, 2012, 58(3): 570-586.

[12]

Keskin N B, Zeevi A. Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations Research, 2014, 62(5): 1142-1167.

[13]

Kirschner J, Krause A (2018). Information directed sampling and bandits with heteroscedastic noise. In Conference on Learning Theory. PMLR.

[14]

Munos R, Stepleton T, Harutyunyan A, Bellemare M (2016). Safe and efficient off-policy reinforcement learning. arXiv: 1606.02647.

[15]

Prokhorov Y V. Convergence of random processes and limit theorems in probability theory. Theory of Probability and Its Applications, 1956, 1(2): 157-214.

[16]

Rakelly K, Zhou A, Finn C, Levine S, Quillen D (2019). Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International Conference on Machine Learning. PMLR.

[17]

Rolnick D, Ahuja A, Schwarz J, Lillicrap T, Wayne G (2019). Experience replay for continual learning. arXiv: 1811.11682.

[18]

Russo D, Van Roy B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 2014, 39(4): 1221-1243.

[19]

Russo D, Van Roy B. Learning to optimize via information-directed sampling. Operations Research, 2018, 66(1): 230-252.

[20]

Russo D J, Van Roy B, Kazerouni A, Osband I, Wen Z. A Tutorial on Thompson Sampling, 2018, USA: Now Foundations and Trends.

[21]

Srinivas N, Krause A, Kakade S M, Seeger M W. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 2012, 58(5): 3250-3265.

[22]

Thomas P, Brunskill E (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR.

[23]

Wang Y, Wang L (2020). Causal inference in degenerate systems: An impossibility result. In International Conference on Artificial Intelligence and Statistics. PMLR.

[24]

Wang Y, Zheng Z, Shen Z-J M (2023). Online pricing with polluted offline data. Available at SSRN 4320324.

[25]

Zanette A, Brandfonbrener D, Brunskill E, Pirotta M, Lazaric A (2020). Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics. PMLR.

AI Summary AI Mindmap
PDF

131

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/