A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games

Li ZHANG, Yuxuan CHEN, Wei WANG, Ziliang HAN, Shijian Li, Zhijie PAN, Gang PAN

PDF(881 KB)
PDF(881 KB)
Front. Comput. Sci. ›› 2021, Vol. 15 ›› Issue (5) : 155334. DOI: 10.1007/s11704-020-9307-6
RESEARCH ARTICLE

A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games

Author information +
History +

Abstract

Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games, e.g., StarCraft and poker. Neural Fictitious Self-Play (NFSP) is an effective algorithm that learns approximate Nash Equilibrium of imperfect-information games from purely self-play without prior domain knowledge. However, it needs to train a neural network in an off-policy manner to approximate the action values. For games with large search spaces, the training may suffer from unnecessary exploration and sometimes fails to converge. In this paper, we propose a new Neural Fictitious Self-Play algorithmthat combinesMonte Carlo tree search with NFSP, called MC-NFSP, to improve the performance in real-time zero-sum imperfect-information games. With experiments and empirical analysis, we demonstrate that the proposed MC-NFSP algorithm can approximate Nash Equilibrium in games with large-scale search depth while the NFSP can not. Furthermore, we develop an Asynchronous Neural Fictitious Self-Play framework (ANFSP). It uses asynchronous and parallel architecture to collect game experience and improve both the training efficiency and policy quality. The experiments with th e games with hidden state information (Texas Hold’em), and the FPS (firstperson shooter) games demonstrate effectiveness of our algorithms.

Keywords

approximate Nash Equilibrium / imperfectinformation games / dynamic games / Monte Carlo tree search / Neural Fictitious Self-Play / reinforcement learning

Cite this article

Download citation ▾
Li ZHANG, Yuxuan CHEN, Wei WANG, Ziliang HAN, Shijian Li, Zhijie PAN, Gang PAN. A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games. Front. Comput. Sci., 2021, 15(5): 155334 https://doi.org/10.1007/s11704-020-9307-6

References

[1]
Arulkumaran K, Cully A, Togelius J. Alphastar: an evolutionary computation perspective. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2019, 314–315
CrossRef Google scholar
[2]
Nash J. Non-cooperative games. Annals of Mathematics, 1951, 54(2): 286–295
CrossRef Google scholar
[3]
Sanholm T. The state of solving large incomplete-information games, and application to poker. AI Magazine, 2010, 31(4): 13–32
CrossRef Google scholar
[4]
Bošanský B, Kiekintveld C, Lisý V, Pěchouček M. An exact doubleoracle algorithm for zero-sum extensive-form games with imperfect information. Journal of Artificial Intelligence Research, 2014, 51: 829–866
CrossRef Google scholar
[5]
Bowling M, Burch N, Johanson M, Tammelin O. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145–149
CrossRef Google scholar
[6]
Browne C B, Powley E, Whitehouse D, Lucas S M, Cowling P I, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(1): 1–43
CrossRef Google scholar
[7]
Brown G W. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951, 13(1): 374–376
[8]
Heinrich J, Lanctot M, Silver D. Fictitious self-play in extensive-form games. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 805–813
[9]
Heinrich J, Silver D. Deep reinforcement learning from self-play in imperfect-information games. 2016, arXiv preprint arXiv:1603.01121
[10]
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. London: MIT Press, 1998
[11]
Myerson R B. Game Theory: Analysis of Conflict. 1st ed. London: Harvard University Press, 1991
[12]
Shi L, Li S, Cao L, Yang L, Pan G. TBQ(σ) improving efficiency of trace utilization for off-policy reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 1025–1032
[13]
Yang L, Shi M, Zheng Q, Meng W, Pan G. A unified approach for multistep temporal-difference learning with eligibility traces in reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2984–2990
CrossRef Google scholar
[14]
Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
CrossRef Google scholar
[15]
Meng W, Zheng Q, Yang L, Li P, Pan G. Qualitative measurements of policy discrepancy for return-based deep q-network. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(10): 4374–4380
CrossRef Google scholar
[16]
Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T P, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
[17]
Silver D, Huang A, Maddison C J, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M,. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484
CrossRef Google scholar
[18]
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A,. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354
CrossRef Google scholar
[19]
Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with backpropagation. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016, 2244–2252
[20]
Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, Wang J. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. 2017, arXiv preprint arXiv:1703.10069
[21]
Heinrich J, Silver D. Smooth uct search in computer poker. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 554–560
[22]
Lis`y V, Lanctot M, Bowling M. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. 2015, 27–36
[23]
Brown N, Sandholm T. Libratus: the superhuman ai for no-limit poker. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 5226–5228
CrossRef Google scholar
[24]
Moravčík M, Schmid M, Burch N, Lis`y V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science, 2017, 356(6337): 508–513
CrossRef Google scholar
[25]
Leslie D S, Collins E J. Generalised weakened fictitious play. Games and Economic Behavior, 2006, 56(2): 285–298
CrossRef Google scholar
[26]
Hendon E, Jacobsen H J, Sloth B. Fictitious play in extensive form games. Games and Economic Behavior, 1996, 15(2): 177–202
CrossRef Google scholar
[27]
Thrun S, Schwartz A. Issues in using function approximation for reinforcement learning. In: Proceedings of the 4th Connectionist Models Summer School. 1993, 1–7
[28]
Anschel O, Baram N, Shimkin N. Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 176–185
[29]
Foerster J, Nardelli N, Farquhar G, Afouras T, Torr P H, Kohli P, Whiteson S. Stabilising experience replay for deep multi-agent reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 1146–1155
[30]
Kocsis L, Szepesvári C. Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning. 2006, 282–293
CrossRef Google scholar
[31]
Audibert J Y, Munos R, Szepesvári C. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 2009, 410(19): 1876–1902
CrossRef Google scholar
[32]
Shah D, Xie Q, Xu Z. On reinforcement learning using Monte Carlo tree search with supervised learning: non-asymptotic analysis. 2019, arXiv preprint arXiv:1902.05213
[33]
Lis`y V, Kovarik V, Lanctot M, Bosansky B. Convergence of Monte Carlo tree search in simultaneous move games. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 2112–2120
[34]
Auger D. Multiple tree for partially observable Monte-Carlo tree search. In: Proceedings of the 14th European Conference on the Applications of Evolutionary Computation. 2011, 53–62
CrossRef Google scholar
[35]
Ponsen M, De Jong S, Lanctot M. Computing approximate nash equilibria and robust best-responses using sampling. Journal of Artificial Intelligence Research, 2011, 42: 575–605
[36]
Cowling P I, Powley E J, Whitehouse D. Information set Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(2): 120–143
CrossRef Google scholar
[37]
Jin P, Keutzer K, Levine S. Regret minimization for partially observable deep reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2342–2351
[38]
Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57
CrossRef Google scholar
[39]
Chaslot G M B, Winands M H, van den Herik H J. Parallel Monte-Carlo tree search. In: Proceedings of the 6th International Conference on Computers and Games. 2008, 60–71
CrossRef Google scholar
[40]
Fang Q, Boas D A.Monte Carlo simulation of photon migration in 3d turbid media accelerated by graphics processing units. Optics Express, 2009, 17(22): 20178–20190
CrossRef Google scholar
[41]
Hill M D, Marty M R. Amdahl’s law in the multicore era. Computer, 2008, 41(7): 33–38
CrossRef Google scholar
[42]
Lanctot M, Waugh K, Zinkevich M, Bowling M. Monte Carlo sampling for regret minimization in extensive games. In: Proceedings of the 23nd Annual Conference on Neural Information Processing Systems. 2009, 1078–1086

RIGHTS & PERMISSIONS

2021 Higher Education Press
AI Summary AI Mindmap
PDF(881 KB)

Accesses

Citations

Detail

Sections
Recommended

/