A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games
Li ZHANG, Yuxuan CHEN, Wei WANG, Ziliang HAN, Shijian Li, Zhijie PAN, Gang PAN
A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games
Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games, e.g., StarCraft and poker. Neural Fictitious Self-Play (NFSP) is an effective algorithm that learns approximate Nash Equilibrium of imperfect-information games from purely self-play without prior domain knowledge. However, it needs to train a neural network in an off-policy manner to approximate the action values. For games with large search spaces, the training may suffer from unnecessary exploration and sometimes fails to converge. In this paper, we propose a new Neural Fictitious Self-Play algorithmthat combinesMonte Carlo tree search with NFSP, called MC-NFSP, to improve the performance in real-time zero-sum imperfect-information games. With experiments and empirical analysis, we demonstrate that the proposed MC-NFSP algorithm can approximate Nash Equilibrium in games with large-scale search depth while the NFSP can not. Furthermore, we develop an Asynchronous Neural Fictitious Self-Play framework (ANFSP). It uses asynchronous and parallel architecture to collect game experience and improve both the training efficiency and policy quality. The experiments with th e games with hidden state information (Texas Hold’em), and the FPS (firstperson shooter) games demonstrate effectiveness of our algorithms.
approximate Nash Equilibrium / imperfectinformation games / dynamic games / Monte Carlo tree search / Neural Fictitious Self-Play / reinforcement learning
[1] |
Arulkumaran K, Cully A, Togelius J. Alphastar: an evolutionary computation perspective. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2019, 314–315
CrossRef
Google scholar
|
[2] |
Nash J. Non-cooperative games. Annals of Mathematics, 1951, 54(2): 286–295
CrossRef
Google scholar
|
[3] |
Sanholm T. The state of solving large incomplete-information games, and application to poker. AI Magazine, 2010, 31(4): 13–32
CrossRef
Google scholar
|
[4] |
Bošanský B, Kiekintveld C, Lisý V, Pěchouček M. An exact doubleoracle algorithm for zero-sum extensive-form games with imperfect information. Journal of Artificial Intelligence Research, 2014, 51: 829–866
CrossRef
Google scholar
|
[5] |
Bowling M, Burch N, Johanson M, Tammelin O. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145–149
CrossRef
Google scholar
|
[6] |
Browne C B, Powley E, Whitehouse D, Lucas S M, Cowling P I, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(1): 1–43
CrossRef
Google scholar
|
[7] |
Brown G W. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951, 13(1): 374–376
|
[8] |
Heinrich J, Lanctot M, Silver D. Fictitious self-play in extensive-form games. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 805–813
|
[9] |
Heinrich J, Silver D. Deep reinforcement learning from self-play in imperfect-information games. 2016, arXiv preprint arXiv:1603.01121
|
[10] |
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. London: MIT Press, 1998
|
[11] |
Myerson R B. Game Theory: Analysis of Conflict. 1st ed. London: Harvard University Press, 1991
|
[12] |
Shi L, Li S, Cao L, Yang L, Pan G. TBQ(σ) improving efficiency of trace utilization for off-policy reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 1025–1032
|
[13] |
Yang L, Shi M, Zheng Q, Meng W, Pan G. A unified approach for multistep temporal-difference learning with eligibility traces in reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2984–2990
CrossRef
Google scholar
|
[14] |
Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533
CrossRef
Google scholar
|
[15] |
Meng W, Zheng Q, Yang L, Li P, Pan G. Qualitative measurements of policy discrepancy for return-based deep q-network. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(10): 4374–4380
CrossRef
Google scholar
|
[16] |
Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T P, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937
|
[17] |
Silver D, Huang A, Maddison C J, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M,
CrossRef
Google scholar
|
[18] |
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A,
CrossRef
Google scholar
|
[19] |
Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with backpropagation. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016, 2244–2252
|
[20] |
Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, Wang J. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. 2017, arXiv preprint arXiv:1703.10069
|
[21] |
Heinrich J, Silver D. Smooth uct search in computer poker. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 554–560
|
[22] |
Lis`y V, Lanctot M, Bowling M. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. 2015, 27–36
|
[23] |
Brown N, Sandholm T. Libratus: the superhuman ai for no-limit poker. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 5226–5228
CrossRef
Google scholar
|
[24] |
Moravčík M, Schmid M, Burch N, Lis`y V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science, 2017, 356(6337): 508–513
CrossRef
Google scholar
|
[25] |
Leslie D S, Collins E J. Generalised weakened fictitious play. Games and Economic Behavior, 2006, 56(2): 285–298
CrossRef
Google scholar
|
[26] |
Hendon E, Jacobsen H J, Sloth B. Fictitious play in extensive form games. Games and Economic Behavior, 1996, 15(2): 177–202
CrossRef
Google scholar
|
[27] |
Thrun S, Schwartz A. Issues in using function approximation for reinforcement learning. In: Proceedings of the 4th Connectionist Models Summer School. 1993, 1–7
|
[28] |
Anschel O, Baram N, Shimkin N. Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 176–185
|
[29] |
Foerster J, Nardelli N, Farquhar G, Afouras T, Torr P H, Kohli P, Whiteson S. Stabilising experience replay for deep multi-agent reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 1146–1155
|
[30] |
Kocsis L, Szepesvári C. Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning. 2006, 282–293
CrossRef
Google scholar
|
[31] |
Audibert J Y, Munos R, Szepesvári C. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 2009, 410(19): 1876–1902
CrossRef
Google scholar
|
[32] |
Shah D, Xie Q, Xu Z. On reinforcement learning using Monte Carlo tree search with supervised learning: non-asymptotic analysis. 2019, arXiv preprint arXiv:1902.05213
|
[33] |
Lis`y V, Kovarik V, Lanctot M, Bosansky B. Convergence of Monte Carlo tree search in simultaneous move games. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 2112–2120
|
[34] |
Auger D. Multiple tree for partially observable Monte-Carlo tree search. In: Proceedings of the 14th European Conference on the Applications of Evolutionary Computation. 2011, 53–62
CrossRef
Google scholar
|
[35] |
Ponsen M, De Jong S, Lanctot M. Computing approximate nash equilibria and robust best-responses using sampling. Journal of Artificial Intelligence Research, 2011, 42: 575–605
|
[36] |
Cowling P I, Powley E J, Whitehouse D. Information set Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(2): 120–143
CrossRef
Google scholar
|
[37] |
Jin P, Keutzer K, Levine S. Regret minimization for partially observable deep reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2342–2351
|
[38] |
Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57
CrossRef
Google scholar
|
[39] |
Chaslot G M B, Winands M H, van den Herik H J. Parallel Monte-Carlo tree search. In: Proceedings of the 6th International Conference on Computers and Games. 2008, 60–71
CrossRef
Google scholar
|
[40] |
Fang Q, Boas D A.Monte Carlo simulation of photon migration in 3d turbid media accelerated by graphics processing units. Optics Express, 2009, 17(22): 20178–20190
CrossRef
Google scholar
|
[41] |
Hill M D, Marty M R. Amdahl’s law in the multicore era. Computer, 2008, 41(7): 33–38
CrossRef
Google scholar
|
[42] |
Lanctot M, Waugh K, Zinkevich M, Bowling M. Monte Carlo sampling for regret minimization in extensive games. In: Proceedings of the 23nd Annual Conference on Neural Information Processing Systems. 2009, 1078–1086
|
/
〈 | 〉 |