1. Key Laboratory of Biomedical Information Engineering of Ministry of Education, Key Laboratory of Neuro-informatics & Rehabilitation Engineering of Ministry of Civil Affairs, and Institute of Health and Rehabilitation Science, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
2. Lanzhou Center for Theoretical Physics and Key Laboratory of Theoretical Physics of Gansu Province, Lanzhou University, Lanzhou 730000, China
huangzg@xjtu.edu.cn
Show less
History+
Received
Accepted
Published
2023-09-19
2023-11-01
2024-08-15
Issue Date
Revised Date
2024-01-19
PDF
(6336KB)
Abstract
Whether the complex game system composed of a large number of artificial intelligence (AI) agents empowered with reinforcement learning can produce extremely favorable collective behaviors just through the way of agent self-exploration is a matter of practical importance. In this paper, we address this question by combining the typical theoretical model of resource allocation system, the minority game model, with reinforcement learning. Each individual participating in the game is set to have a certain degree of intelligence based on reinforcement learning algorithm. In particular, we demonstrate that as AI agents gradually becomes familiar with the unknown environment and tries to provide optimal actions to maximize payoff, the whole system continues to approach the optimal state under certain parameter combinations, herding is effectively suppressed by an oscillating collective behavior which is a self-organizing pattern without any external interference. An interesting phenomenon is that a first-order phase transition is revealed based on some numerical results in our multi-agents system with reinforcement learning. In order to further understand the dynamic behavior of agent learning, we define and analyze the conversion path of belief mode, and find that the self-organizing condensation of belief modes appeared for the given trial and error rates in the AI system. Finally, we provide a detection method for period-two oscillation collective pattern emergence based on the Kullback−Leibler divergence and give the parameter position where the period-two appears.
The emergence of collective behaviors [1–10] is an intriguing enigma in self-organizing systems, which is ubiquitous in living systems (e.g., starling flock, ant colony, fish swarm and microbial swarms), human social systems (e.g., the formation of social norms, political behavior and collective decision-making), and computer simulation systems (e.g., Game of Life). As a typical self-organizing systems, resource allocation systems that a basis of the modern economy and society emerge diverse collective behaviors, in which the mechanism of minority wins [11] often dominates individual behavior, and patterns of collective behavior of individuals are critical to the efficiency and functional realization of systems.
To uncover the dynamical mechanism of collective behavior evolution in resources allocation system, a large number of theoretical models involving minority games have been proposed [12] and developed [13–23] in the last 20 years, aiming to reveal various collective behavior patterns and their corresponding dynamic mechanisms. The idea of minority game has been also widely used in the theoretical modeling and analysis of various real resource allocation systems, such as resource allocation in cloud manufacturing [24], collective decision making in fish schools [25], and arbitrageurs decision-making behavior in financial markets [26]. Another focused core problem is to control the dynamical evolution of the system by means of an external intervention to suppress herd behavior, e.g., pinning control [15], because the suppression of herd effect can not only effectively improve the efficiency of system resource utilization, but also maintain long-term stable sustainable resource supply.
In recent years, machine learning is advancing rapidly, especially as one of the learning paradigm reinforcement learning plays an important role in the field of artificial intelligence research. Reinforcement learning is an algorithm that makes optimal decisions through continuous interactive learning with the environment by trial and error. So far, it has been widely and successfully used in finance field [27–29], disease prediction [30, 31], game decision system [22, 32–35] and other research fields. It is exciting to note that reinforcement learning can also be applied to complex systems with multiple agents. For example, cooperative behavior can be promoted by Lévy noise and some interesting collective modes of oscillatory cooperation have emerged in evolutionary game systems with reinforcement learning [36–38]. This will provide a new perspective for the challenges of revealing the emergence mechanism of collective behavior. As shown in our previous work [39–41], the resource allocation systems combined Reinforcement learning with delayed rewards and minority game can emerge with self-restoring collective behavior to resist outside interference, and find that the system can evolve to the optimal resource allocation state without external intervention.
However, the reward mechanism or the duration of the reward can make a big difference in the effect of social group behavior based on a series of social surveys [42], and the authors also emphasized that weaker performers need to be rewarded immediately, rather than in the distant future to address their poverty. Inspired by this thought, we further investigate the emergence of collective behavior in resource allocation systems with the reinforcement learning manner under the condition of immediate rewards, and the decision-making. Without loss of generality, our multi-agent AI systems compete for limited resources in order to maximize their own payoff through Q-learning algorithms, and it is just that the reward mechanism is different from our previous work [39]. Some typical questions are whether the resource allocation can still be optimized under the immediate reward to agent, whether herding can be effectively inhibited, and what kind of paradigm of collective behavior can emerge. In this paper, we address these questions of how a multi-agent AI system can regulate the operation of the entire game system based on the suppression of adverse dynamic behavior. In addition, we also analyze the effects of various learning parameters and system size on the oscillatory evolution of collective behavior in the AI systems. Addressing these issues is of paramount importance because the establishment of such a framework can further understand the collective behavior emergence and its mechanism of large-scale multi-agent complex game systems. It is also the primary step to design human-machine system, which is the inevitable trend in the future.
This paper is organized as follows. In Section 2, we introduce reinforcement learning algorithms with the immediate reward mechanism into minority game systems (RLMG systems). To show the evolution of collective behavior of the AI systems, some major numerical simulation results are presented on multiple combinations of learning parameters in Section 3. We also revealed an interesting and important phenomenon that first-order phase transition appears with exploration rate in the multi-agent AI systems and two phases can also be observed (period-two oscillatory mode and non-period oscillatory mode) in Section 4.1. In order to further understand the learning process of the agent, we analyze the conversion path between belief modes in Section 4.2 and give the self-organizing condensation phenomenon of belief mode in the appendix. Moreover, a finding is that ergodic breaking of the system occurs and resource allocation performs better than random systems without external interference for a small exploration rate in Section 4.3. Then, in Section 4.4, a detection method for period-two oscillation collective pattern emergence based on the Kullback−Leibler (KL) divergence is introduced and give the parameter position where the period-two appears. Finally, a discussion and conclusion is provided in Section 5.
2 RLMG model
Without loss of generality, our system contains agents competing for limited resources denoted by , and . The capacity of each resource is set , which is the maximum number of agents that each resource can hold. For simplicity, the resource capacity is consistent, which is . The vector to reflect the agent’s preference for resources in time . The component represents the proportion of agents with resource . If , the individuals who chooses the resource are winner because the number of agent dose not exceed the maximum load of the resource in this round. On the contrary, if , the agents are loser due to resource overload in this round.
Q-learning is a typical reinforcement learning algorithm [43], it is used to combine with the minority game in our model. The goal of the agent empowered by Q-learning is to learn an optimal strategy through continuous interaction with the environment. That is, starting from the current state, it maximizes the expected value of the total reward in any and all subsequent steps. The relationship between the state of each agent, the actions available and the rewards obtained can be parameterized by the Q-function. Strategies that obtain higher rewards is strengthened by updating the Q-function in each round game. In order to seek for higher rewards, the agents always try to improve their wisdom via the Q-function and trial and error during the course of the game among agents. The update of the agent status and Q-function are just equivalent to the synchronization updating of Monte Carlo simulations [44, 45]. Each agent is empowered by reinforcement learning algorithm: Q-learning. The available state set is and its elements indicate the resource label that each agent can occupy, and the action set is and its elements represent the actions that each agent can choose in any state for Q learning algorithm. In most typical Q-learning tasks, the elements in and are distinguished by their environment properties. They are just same in our minority game model for both the action set and the state set.
The function is a time-dependent memory matrix combination as
The rows of this matrix represent the possible states of the agent, and the columns represent actions. If the agent is in the state and takes action at the time step , according to the Bellman equation, the element of this matrix is updated as follows:
where the subscripts and denote the current state of agent and the action that the agent may take, respectively, is the learning rate, and is the reward obtained by the agent for taking action in time step . The parameter is the discount factor that measures the impact of future rewards. Agents with are short-sighted that they only care about current interests, while those with the larger value of are considered for a longer term. is an optimal estimate of future values at , the state is derived from the output of action based on at the current time step. According to Eq. (1), Q-function is the accumulation of historical experience through the iterative update of memory matrix elements, it reflects the quantitative relationship between state, action, and reward. As the time to explore the environment increases, the Q-function performs better and better based on the feedback of the action reward.
In our model, each agent in the system has its own memory matrix . As it starts to interact with the environment, the adaptability of the Q-function will increase rapidly. Just like the basic steps of Q-learning algorithm, while agents make decisions mostly based on their own memory matrix, it is also necessary to select the action with a certain randomness for trial and error. In summary, for a given set of parameters and , the Q-learning algorithm is shown in Fig.1:
1) Randomly initialize all agents’ memory matrix and their state .
2) Each agent takes the action following -function with probability , that is
or randomly selects an action with probability (also known as exploration rate) for each round. Meanwhile, if the action leading to the state is the current winning (minority) state, the agent gets a reward from the environment. Instead, the state is the failed (majority) state, the reward .
3) The agent updates the element value of memory matrix following Eq. (1), and its state is also updated as .
4) Repeat the process 2)−3) until the system becomes stable or reaches a preset termination condition.
As stated in Ref. [42], differences in reward mechanisms can lead to very different dynamics behavior. If the feedback reward was delayed by a time step in our previous work [39], an explosion oscillation collective behavior that tends to the optimal state emerges in that system. However, in our current minority game system, the agent empowered by reinforcement learning receives an immediate reward for an action during the process of understanding environment. A remarkable feature is that the collective behavior presents the gentle oscillation around the optimal state without the explosion phenomenon, and as the exploration rate increases, the system undergoes a phase transition. In addition, our minority game system combined with RL is also different from other studied game systems [15], because our system takes into account the complex coupled memory taken action by agent in the past and adjusted by parameters , and the reward to optimize the allocation of resources.
3 Simulation results
In this paper, we focus on the case with two resources () that the simplest minority game model with Q-learning agents. Thus, each agent could be in one of two resources (states) or , and it could choose one resource as its action or . The Q-function can be expressed as
Actually, Q-function is the Cartesian product form about the state set and the action set . The resource capacity is limited to . On the evolution beginning, all the elements in agents’ Q-function are randomly initialized in the interval of , and the agents’ states are randomly specified from .
Here, the occupation ratio of the resource denoted by is defined as
where is the number of agent selecting resource , is the state of agent and is Dirac delta function, and if the state of agent is at time , otherwise . Obviously, the optimal resource allocation scheme is in the RLMG system. In order to measure the performance of the RLMG system, a commonly used metric is
which reflects the fluctuation of deviating to the optimal allocation of resources over a long period [15]. In order to eliminate the influence of limited system size, the form is used.
Fig.2 shows of the RLMG systems as the function of the exploration rate . A striking result is that jumps discontinuously at a specific exploration rate denoted by , which is widely observed in different parameter combinations as shown in Fig.2(a)−(c). This discontinuous jump indicates the dynamical behavior of the system undergoes a first-order phase transition and the transition behavior is further discussed in Section 4.1. In addition, for fixed discount factors and as shown in Fig.2(a) and (b), the transition point decreases monotonically as increases. Fig.2(c) shows that also relies on discount factors for a fixed learning rate but it has no monotonic relation. For all combinations of learning parameters and , is independent of system size except for the region close to 1, as the occupation behavior of agents is almost random and breaks the collective mode as shown in Fig.2(d). From simulations of different learning parameters, it is find that the qualitative behaviors of with respect to are identical except for the position of the transition point .
The discontinuous phase transition breaks the regular oscillation of the time series . Taking learning parameters and [the orange diamond symbol in Fig.2(b)] as a concrete example, the evolutionary time series of near and far from the transition point are presented in Fig.3. A stable period-two oscillation appears in the range , while it is broken and turns into an irregular oscillation for . In fact, stable period-two oscillation is a common collective behavior in minority game models [14, 15, 46], and it is a kind of herd behaviors which generates inefficient resource allocation. To suppress herd behaviors an outfield control, such as pin control [46], often needs to be applied. However, the RLMG system composed of AI agents can effectively suppress herding at very low exploration rate and organizes itself into an optimal resource allocation state by breaking regular oscillation as shown in Fig.3(a). With the increase of , the amplitude of the irregular oscillation is enlarged rapidly, eventually leading to a stronger herding behavior. Fig.3(d)−(h) show temporal evolutions of at the right of the transition point . As increases, the amplitude of the regular oscillation first increases and then decreases, which corresponds to the non-monotonic behavior of in Fig.2 and the peak position of is denoted by . Further research shows that the amplitude of the regular oscillation increases with and , respectively, and decreases with , but there is no obvious pattern for discounting factors . For any given learning parameters combination and , the amplitude and are independent of the system size . As , the amplitude of the periodic oscillation gradually decreases until it is dominated by noise, and completely degenerates into a random oscillation, since all AI agents randomly access any resource .
4 Analysis
4.1 The discontinuous transition
In this subsection, we investigate the jump of resource load standard deviation . A major question is whether the jump of means that the intelligent system undergoes a first-order phase transition. To answer this question, the collective oscillation modes on both sides of the transition points, shown in Fig.4(a) and (b), have to be distinguished. At the right side of the transition point, exhibits a relatively stable period-two oscillation mode (PTOM), for example in Fig.4(a), and this mode only maintains itself until the noise intensity overwhelms it at . This is a robust dynamic phase in RLMG systems because this oscillation pattern can not be destroyed with finite noise, which is discussed in Section 4.4. It is worthy to mention that the amplitude of this oscillation mode shows a non-monotonic varying with the increase of , but the period-two oscillation is still maintained. However, the nature of this non-monotonic form of (a single oscillation mode determined only by the oscillation amplitude) is the result of competition between two driving, namely, the AI agent’s exploration (reflected by ) and the pursuit of profit maximization (reflected by memory matrix Q-function). The position of reaching its maximum is denoted by . In the interval , the revenue of agents dominates the evolution of the system, and the degree of herding behaviors are enhanced with . In the interval , the exploration of agents plays a leading role, which makes the elements in the Q-function to be randomly selected and updated. Therefore, herd effect is suppressed to some extent, but the optimization performance is not better than random systems.
On the left side of the transition point , there are abundant oscillatory evolution modes mainly manifested by the difference of oscillation frequency and amplitude in RLMG system. These oscillations are called non-period oscillation mode (non-POM). It should be emphasized that, for a given exploration rate , the evolution of switches flexibly between different oscillatory modes instead of a fixed oscillatory evolution pattern as shown in Fig.4(b) and (c). To depict the oscillatory behavior, we investigate the power spectral of which is defined as following:
where is the Fourier transform of with length . Fig.4(e) shows two examples of how the power spectrum is distributed over the period on the left of . This is a special dynamic phase of MG systems with reinforcement learning agents. As the exploration rate approaches the transition point, an obvious switch occurs between the two phases as shown more intuitively in Fig.4(d), i.e., PTOM phase and non-POM phase. One can see from Fig.4 that the self-organized collective behavior emerges in the system for . In an interval of greater exploration rate , the period-two oscillation mode keeps while the amplitude changes. The key factor is that in this exploration rate interval, the Q-function can self-organize into a very robust structure. In contrast, for a small exploration rate , the matrix structure is fragile, so that a variety of structures emerge and are owned by small systems of different sizes.
To analyze the transition between PTOM and non-POM, the order parameter is defined as following:
where is the symbolic function that for and for negative . The length of the system time series that we take is . The order parameter is within . If , the system is oscillated with period two (PTOM), i.e., in ordered phase. Otherwise, if , the period-two oscillation is broken, which means it turning into non-POM. Taking learning parameters and as an example, in which , changes suddenly as shown in the illustration of Fig.5(a), which is consistent with the jump point of in Fig.2(b). As increases, the order parameter decreases gradually, which reflects the destruction effect of noise on the periodic oscillation, and the exploration rate is denoted by that the period-two oscillation is exhaustively broken for randomness of exploration.
According to Binder’s research about phase transition [47, 48], the transition should be a first-order phase transition, if Binder cumulant moment of the order parameter changes to a negative value at the transition point . For the transition breaking PTOM, the Binder cumulant moment of is defined as
is shown as a function of in the main panel of Fig.5(a). Obviously, when reaches where the discontinuous jump of occurs, changes suddenly to a negative value. This is a significant evidence that the system undergoes a first-order phase transition at . However, it is also noticed that also decays to a negative value at . One possible reason is that there is additional complex oscillatory mode switching at very low noise intensity. Unfortunately, this switch is difficult to detect in more detail.
This transition also has other general characteristics of the first-order phase transition. Constructing the evolution of system by slowly increasing (or decreasing) exploration rate , the hysteresis loop phenomenon of near is observed as shown in Fig.5(c), which is in accordance with other discontinuous phase transitions [49–51]. Ref. [52] has emphasized that a more severe slowing-down occurs at the first-order phase transition point, since the free energy barriers separate the ordered and disordered phases. The dynamics of the system present an exponential slowing-down with the system size near transition point. To detect the slowing-down form, the lifetime of the PTOM for a fixed system size is defined as follows:
where is the total number of time steps of a complete PTOM between two adjacent switching about modes of oscillation, and is the total number of PTOM in a long enough time series of . In our system, the lifetime is also found to have the exponential slowing-down at the transition point. That is, the duration of PTOM or non-POM as a exponential function of system size , i.e., [see Fig.5(b)]. This indicates that the RLMG system has structure of bistable state around the transition point, which is an typical feature of first-order transition. Fig.5(d) demonstrates that the gap of between PTOM and non-POM at transition point does not vanish with system size .
4.2 The formation and transformation of belief mode
Because the AI agent has only two states at any time, namely or , and the actions it can take for each state are or . Each AI agent must take an action to participate in the minority game for maximizing its payoff based on feedback from the environment through reinforcement learning algorithm, and then update its Q-function and state. As we all know, AI agent constantly try and error with a certain probability to explore the optimal strategy in typical Q-learning algorithms, or take action following -function with the probability , see Fig.1. If an AI agent makes decisions based on -function, its Q-function structure is further solidified. Conversely, if the AI agent takes a random action to explore the environment, the Q-function structure is slightly broken down and reconstructed. Therefore, agent learning can be broken down into two processes, the former is called memory reinforcement updating process (mR-process) and the latter is trial and error or exploration updating process (tE-process). Specifically, if tE-process occurs with a small probability , it can be regarded as a disturbance to mR-process.
In order to further understand the dynamic behavior of RLMG system, we tracked the learning process of a randomly selected AI agent adapting to the environment. The belief mode of AI agent based on Q-function and state as is defined, which is the action corresponding to the largest element in the row of . Therefor, for a fixed set of states and fixed set of actions , the belief mode set is represented as and is time-dependent. The belief mode that an agent belongs to indicating a preference for a certain resource. That is, the AI agent’s belief is the action when the state is at the current moment for the belief mode . To quantify the strength of AI agent preference for resources, we defined the belief mode function as follows:
where is the average gap between the largest element and other elements at in the row of state . The larger the gap, the stronger the robustness of belief mode, which implies a more persistent preference of agent for action . Based on Eq. (9), we find that the number of belief modes depends on the dimension of the Q-function. A larger Q-function dimension corresponds to a larger number of belief modes. For example, there are nine belief modes for a matrix and eight belief modes for a matrix.
In our system, only two resources are set, called and , then , and , . Therefor, there are four belief modes: , , and for each AI agent in this two-resource RLMG system. For mR-process and tE-process, the transformation of belief mode can be fully represented by a double-coupled directed network, in which belief mode is the node of the network and directed edge is the transformation direction between modes. The coupling between the two layers reflects the interaction of the two processes as shown in Fig.6. Certain transitions occur between modes in the process of learning and adapting to the environment. The belief mode of the AI agent needs to go through the fixed evolution path, which includes two processes: mR-process (Solid arrow, the top layer in Fig.6) and tE-process (Dash line, the down layer in Fig.6). Each transformation between belief modes, including the self-transformation (namely, ), is accompanied by the update of the corresponding matrix elements (the end of the arrow). If only according to the mR-process, AI agent’s belief mode transformation path is as follows: , which is an ordered closed loop. However, the occurrence of -process destroys the orderliness of this closed loop and leads to a jump transformation between modes by adopting the opposite strategy of the -function.
The belief mode of each AI agent in our formalization is reflected by its own matrix and current status. For a given game dynamics, the essence of agent learning is the mutual conversion of belief modes to maximize returns. As in our RLMG system, the and modes flow more toward the and modes until the system reaches a dynamic equilibrium for a fixed exploration rate shown in Fig. A1 of the Appendix, that is, the self-organizing condensation of belief modes appeared in the system. There exists strong nonlinear correlation effect between mR-process and tE-process in the process of agent adaptive reinforcement learning. Different from the noise disturbance of classical nonlinear dynamics system, tE-process not only perturbs the agent’s current decision, but also perturbs or unlocks its belief mode which may be responsible for the oscillatory convergence phenomenon of the system. It will then indirectly affect the subsequent decision-making of the agent through the mR-process with greater probability. The evolutionary behavior of belief modes is discussed more detail by means of visualization in the appendix.
4.3 Initial sensitivity for low exploration rate
In this subsection, we focus on the dynamic behavior when the exploration rate approaches . An interesting phenomenon is that RLMG system shows more efficient resource allocation than a random system when is close to for almost all possible combination of parameters (see Fig.7). In other words, optimal allocation of resources gradually emerges under the condition of small noise in RLMG system. The solid red lines in Fig.7 represent the resource load fluctuation of a complete random system, which is a constant . A large number of numerical results show that the RLMG system has an optimal exploration rate , i.e., the exploration rate corresponding to the minimum , where . This is a very significant finding, which means the deviation of resource load is independent of system size . For the constraint of Q-function in reinforcement learning algorithm with the small noise, the system can emerge spontaneously with a collective mode that effectively suppress herding behavior.
In addition to the self-organizing optimization, the collective behavior of the system also shows sensitivity to initial conditions with low exploration rates. The initial occupation ratio of any resource and initial element distribution of agents’ Q-functions can both affect of future system. As shown in Fig.7(a), for the Q-function initialization of three different structures, which are , , and random initializing all elements of in range of , respectively. The presents different optimal position , while they all have the same initial occupation ratio . In Fig.7(b) and (c), the initialization structure of Q-function of the AI agents in the system is the same, but the s are different. The system is observed to have the same optimal position for a given initialization, but the curves of are distinct. This phenomena are similar to the glass transition observed in some theoretical models of glass, that is, the system undergoes a glass transition when the temperature, which corresponds to the exploration rate in our system, drops rapidly. The system undergoes a process from ergodic to ergodic breaking state. An interesting finding is that the optimal position of the system resources configuration is uniquely determined by the initialization of the table, independent of the initial proportion of the resource load . The relative value of the initial Q-function elements plays a major role in the position of .
4.4 The period-two oscillation mode breaking at ϵ = 1
As mentioned in the previous Section 4.1, the period-two oscillation mode gradually loses stability and cannot be identified when exploration rate is close to 1 in the multi AI-agent system. An important question is whether this regular oscillation pattern (i.e., PTOM) can be destructed for finite exploration rate or it is an finite size effect. To investigate this problem, the KL divergence is introduced to depict the destruction behavior of the collective oscillation:
Specifically, KL divergence is used to measure the difference between two distributions. In this case, the two distributions correspond to that the numbers of agents occupying resource and that the distribution of with , respectively. For period-two oscillation, is a typical bimodal distribution. As approaches 1, is going to degenerate to Gaussian distribution, denoted by .
Fig.8(a) demonstrates the bimodal probability density function (PDF) with (green) and (red). The emergence of bimodal structures distribution indicates that the multi-agent system is manipulated by the reinforcement learning algorithms, and thus deviated from the random system. Moverover, the bimodal structure further implies that a period-two oscillation mode emerges in the system through dynamical bifurcation. The gap between two peaks increases with the exploration rate decreasing, that is, the more distinct PTOM behavior can be observed. Different shapes of empty symbols represent different system sizes. Interestingly, it is found that the PDF of various system sizes can be re-scaled to a single curve by a simple relation . This is consistent with that is independent with system size in period-two oscillation region, shown in Fig.2(d).
Fig.8(b) shows the variation curve of KL divergence with in logarithmic coordinates. The KL deviation with the exploration rate re-scale relationship is given as following:
for different system sizes. This power law behavior can show that the oscillatory collective behavior of period-two does not suddenly appear under certain , but is always present in the interval . This is analogous to the phase transition at the critical value , where only one phase exists in , namely, PTOM. Therefore, the order of the system begins to decrease at , simply because the strong noise drowns out the oscillating pattern of period-two. In addition, we found that the KL divergence is not independent of the system size when , given by the different shape symbols of certain color, but also does not depend on learning parameter combinations, given by the different color symbols. For clarity, the gap between curves of different colors, including green symbol , yellow symbol and blue symbol , shown in Fig.8(b) is caused by artificial translation.
5 Discussion and conclusion
The emergence of collective behavior from the various complex systems which consist of a large number of interacting components is universal in ecosystem, social and economic system. Collective behavior can be positive, such as efficient teamwork, bird foraging and animal migration, or negative, such as investor herding in the stock market, stampedes and traffic jams. The herding behavior is one of the typical collective behaviors of complex resource allocation systems and can spread quickly through the system causing it to collapse. The best performance of such a system is usually measured by the sustainable and maximum use of resources by all individuals in the system. One of the core objectives of management resource allocation system is to avoid the occurrence of collective behavior with herd property through reasonable mechanism regulation. Machine learning paradigm provides a new perspective for the study of collective behavior emergence in complex systems. In addition, artificial intelligence will increasingly penetrate into all aspects of human society, such as self-driving car clusters and drone clusters, authorized by various machine learning paradigms in the future. Therefore, it is of great practical significance to explore the emergent of collective behavior and its mechanism through the marriage of AI and complex systems, especially the adaptive regulation of collective behavior evolution in complex systems consisting of a large number of AI agents. More generally, the main goal of our research is how machine learning in AI agents system with immediate reward affects collective dynamics in complex systems.
In this paper, we introduced the minority game, a typical model of resource allocation systems, incorporating reinforcement learning (RL): Q-Learning algorithm, by building individual simplified AI models. In our AI agents complex system, the individuals are intelligent to some extent, and they are capable of reinforcement learning. Intelligent groups can self-organize and evolve to reach a predetermined goal based on the feedback reward of the environment and the simplified Q-learning algorithm. With the exploration rate increasing (similar to the temperature of thermodynamic systems), an interesting transition phenomenon is also found in AI complex systems, that is first-order phase transition. Further research shows that two phases can also be observed for the AI system: period-two oscillatory evolution mode and non-period oscillatory evolution mode. This means that the system is regulated by the exploration rate of the learning parameter in the reinforcement learning algorithm, and the multi-agent system can emerge with two kinds of collective behavior (regular stable oscillation and irregular oscillation). As with traditional thermodynamic systems, a exponential slowing-down has also been discovered near the transition point . In short, different from the existing complex multi-agent game system with reinforcement learning, in our system, firstly, the agent has a simple memory matrix (Q-function) with low computational complexity. Secondly, each agent decision is independent, and the interaction between agents only comes from the feedback reward of the environment. Third, based on this simple model, intelligent game groups can realize self-organizing optimization for a given combination of parameters and switching behavior patterns.
In addition, an important phenomenon is that when the trial-and-error rate of the agent is very low, the system self-organize into a state of optimal resource allocation without any external intervention at , which is much more optimized than the random system. This indicates that large-scale complex systems composed of AI individuals can realize self-organization and collaboration based on individual interaction and reinforcement learning, and then emerge positive collective behaviors under certain learning parameters. The definition of belief mode and its transformation path give the emergent paradigm and evolutionary mechanism of the collective behavior of multi-agent systems from a more microscopic perspective. For example, for a specific parameter combination, with the advance of the learning process, mode and mode gradually dominate, while mode and gradually erode, but they will not disappear. By defining the belief modes approach, a clearer physical picture is presented to understand the evolution of the RLMG system dynamics behavior. We give the identity transformation path of an AI individual when it interacts with the environment, and the transformation path is universal regardless of the learning parameters for a given game dynamics. Then the emergence of period-two oscillation modes is demonstrated under high exploration rate, and concluding that the collective behavioral emergence of AI complex systems is qualitatively strongly dependent on the exploration rate and the learning parameter which can determine the point of the phase transition .
Finally, in our RLMG system there are four key exploration rates summarized below: , , , . The smallest implies an optimal resource allocation. represents the transition point of the first order phase transition, means the system enters an trade-off between the two driving, namely the Q-function and the noise effect, and the maximum indicates that the period-two oscillation mode is gradually destroyed by noise and cannot be clearly observed. As our theoretical results suggest that slightly different reward mechanisms have a great impact on the evolution of collective behavior in real social systems [42], for example, a delayed reward was set up in our previous work [39] which emerged the collective behavior with periodic explosions and can self-organize recovery against external disturbances. For immediate reward settings, however, the system exhibited a more varied and mildly oscillating behavior that optimizes resource allocation.
Our work can provide a basic theoretical framework for the integration of machine learning and complex systems. The systematic study of complex game systems from the perspective of machine learning can further help us to understand the emergence and evolution of diversified oscillating collective behavior patterns and their mechanisms. Because agents with a certain level of intelligence are closer to the activity pattern of biological groups. Based on our findings, there are still a lot of concrete unanswered questions that are very interesting. Different from the traditional game system, the agent with reinforcement learning ability can form a preference belief mode for alternative actions by its Q-function. How the flow between belief mode drives the emergence of collective behavior in complex system is a significant and interesting question. In addition, from control point of view, by adjusting the learning parameters of the machine learning algorithm, system size or game parameters, it is of practical importance to make the system emerge positive collective behavior, or restrain unfavorable collective behavior.
6 Appendix: Self-organizing condensation of belief mode in RLMG system
In this section, we focus on the evolutionary behavior of belief modes as the agent becomes familiar with the environment. For simplicity, the resource configuration system still sets two resources as and . The set of belief mode contains only four elements: . In Section 4.2, the set of belief modes and the belief strength are defined for AI agents in RLMG game system [Eq. (9)]. One of our concerns is the evolution behavior of belief modes as the system moves from a random initial state to a steady state for the AI system. Here, a concrete example is given to shown the belief evolution in the AI system that the learning parameters are set to , and .
We take snapshots of the system during training and observe the evolution of four belief modes as shown in . The occupancy of the agent for the four belief modes is nearly equal to about as shown in (a), due to the random initialization of matrix and the state of the agent at the initial time . Belief modes begin to flow or transform between them with the evolution of AI system or agents learning because the elements in the Q-function are updated to adapt the dynamic environment. Therefor, the ratio of belief modes changes rapidly adaptation in the snapshot (b) at time . That is, the occupancy of mode is close to , and the occupancy of mode is close to . This indicates that the inflow in mode is greater than the outflow, and the net flow is positive. As the system is trained, for example, until time step or a longer time in (c) and (d), the net flow of the mode or becomes negative. In fact, the property of belief modes and , which is a speculative belief, is to cause the agent to change decisions at adjacent moments, and trigger the emergence of a oscillations collective behavior. For certain reward mechanisms, the oscillating belief mode is absorbed by other belief mode to maximize overall RLMG system benefits. As the evolution of the RLMG system is close to stable, the occupancy of oscillatory modes of and decreases gradually, large-scale condensation of agents occurs on and modes. Therefore, modes and are more like a medium that can guide the system into a high ordered collective behavior for any learning parameters. But the resource load temporal is disordered or non-period oscillation in the left of the transition point .
We summarize the evolutionary condensation of belief modes into the following steps with four properties: (i) The belief modes of all agents in the system are occupied uniformly and randomly at the initial time. (ii) With the self-organizing learning of the AI system, the belief modes and show large asymmetric fluctuations. The system is in a large oscillation, because the main occupation states of the agents at this stage are and . (iii) In the third stage of system evolution, the proportion of belief mode states starts to change relatively smoothly. The oscillation modes and gradually flow to and . Therefore, the resource preference is in a stable oscillation state, and the oscillation amplitude is gradually reduced. (iv) The system converges to a relatively stable state and forms a dynamic equilibrium. A large number of agents condense on and belief modes.
In a word, as the RLMG system is constantly trained, a large number of agents gradually self-organize on the and belief modes, and then reduce the fluctuation amplitude of the system. What is interesting is that this condensation of the and modes occurs for any given exploration rate , although the resource load is a non-period oscillation to the left of the phase transition point.
D.J. Sumpter, Collective Animal Behavior, Princeton University Press, 2010
[2]
A. Procaccini, A. Orlandi, A. Cavagna, I. Giardina, F. Zoratto, D. Santucci, F. Chiarotti, C. K. Hemelrijk, E. Alleva, G. Parisi, C. Carere, Propagating waves in starling. Sturnus vulgaris, flocks under predation. Anim. Behav., 2011, 82(4): 759
[3]
H. King, S. Ocko, L. Mahadevan. Termite mounds harness diurnal temperature oscillations for ventilation. Proc. Natl. Acad. Sci. USA, 2015, 112(37): 11589
[4]
C. R. Reid, T. Latty. Collective behaviour and swarm intelligence in slime moulds. FEMS Microbiol. Rev., 2016, 40(6): 798
[5]
Y. T. Lin, X. P. Han, B. K. Chen, J. Zhou, B. H. Wang. Evolution of innovative behaviors on scale-free networks. Front. Phys., 2018, 13(4): 130308
[6]
L. M. Ying, J. Zhou, M. Tang, S. G. Guan, Y. Zou. Mean-field approximations of fixation time distributions of evolutionary game dynamics on graphs. Front. Phys., 2018, 13(1): 130201
[7]
N. T. Ouellette. A physics perspective on collective animal behavior. Phys. Biol., 2022, 19(2): 021004
[8]
H. Murakami, M. S. Abe, Y. Nishiyama. Toward comparative collective behavior to discover fundamental mechanisms underlying behavior in human crowds and nonhuman animal groups. J. Robot. Mechatron., 2023, 35(4): 922
[9]
I. B. Muratore, S. Garnier. Ontogeny of collective behaviour. Philos. Trans. R. Soc. Lond. B, 2023, 378(1874): 20220065
[10]
Y. Liang, J. P. Huang. Robustness of critical points in a complex adaptive system: Effects of hedge behavior. Front. Phys., 2013, 8(4): 461
[11]
W.B. Arthur, Inductive reasoning and bounded rationality, Am. Econ. Rev. 84(2), 406 (1994), 106th Annual Meeting of the American-Economic-Association, BOSTON, MA, JAN 03-05, 1994
[12]
D. Challet, Y. Zhang. Emergence of cooperation and organization in an evolutionary game. Physica A, 1997, 246(3‒4): 407
[13]
T. Zhou, B. H. Wang, P. L. Zhou, C. X. Yang, J. Liu. Self-organized Boolean game on networks. Phys. Rev. E, 2005, 72(4): 046139
[14]
Z. G. Huang, J. Q. Zhang, J. Q. Dong, L. Huang, Y. C. Lai. Emergence of grouping in multi-resource minority game dynamics. Sci. Rep., 2012, 2(1): 703
[15]
J. Q. Zhang, Z. G. Huang, J. Q. Dong, L. Huang, Y. C. Lai. Controlling collective dynamics in complex minority-game resource-allocation systems. Phys. Rev. E, 2013, 87(5): 052808
[16]
J. Q. Dong, Z. G. Huang, L. Huang, Y. C. Lai. Triple grouping and period-three oscillations in minority-game dynamics. Phys. Rev. E, 2014, 90(6): 062917
[17]
A. Cuesta, O. Abreu, D. Alvear. Methods for measuring collective behaviour in evacuees. Saf. Sci., 2016, 88: 54
[18]
X. H. Li, G. Yang, J. P. Huang. Chaotic−periodic transition in a two-sided minority game. Front. Phys., 2016, 11(4): 118901
[19]
L. Chen. Complex network minority game model for the financial market modeling and simulation. Complexity, 2020, 2020: 8877886
[20]
S. Biswas, A. K. Mandal. Parallel Minority Game and its application in movement optimization during an epidemic. Physica A, 2021, 561: 125271
[21]
T. Ritmeester, H. Meyer-Ortmanns. Minority games played by arbitrageurs on the energy market. Physica A, 2021, 573: 125927
[22]
B. Majumder, T. G. Venkatesh. Mobile data offloading based on minority game theoretic framework. Wirel. Netw., 2022, 28(7): 2967
[23]
J. Linde, D. Gietl, J. Sonnemans, J. Tuinstra. The effect of quantity and quality of information in strategy tournaments. J. Econ. Behav. Organ., 2023, 211: 305
[24]
D. Carlucci, P. Renna, S. Materi, G. Schiuma. Intelligent decision-making model based on minority game for resource allocation in cloud manufacturing. Manage. Decis., 2020, 58(11): 2305
[25]
A. Swain, W. E. Fagan. Group size and decision making: experimental evidence for minority games in fish behaviour. Anim. Behav., 2019, 155: 9
[26]
T. Ritmeester, H. Meyer-Ortmanns. The cavity method for minority games between arbitrageurs on financial markets. J. Stat. Mech., 2022, 2022(4): 043403
[27]
Y. Deng, F. Bao, Y. Kong, Z. Ren, Q. Dai. Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans. Neural Netw. Learn. Syst., 2017, 28(3): 653
[28]
Z.JiangD.XuJ.Liang, A deep reinforcement learning framework for the financial Portfolio management problem, arXiv: 1706.10059 (2017)
[29]
H.YangX.Y. LiuS.ZhongA.Walid, in: Proceedings of the First ACM International Conference on AI in Finance, ICAIF’20, Association for Computing Machinery, New York, NY, USA, 2021
[30]
J. A. Cruz, D. S. Wishart. Applications of machine learning in cancer prediction and prognosis. Cancer Inform., 2007, 2: 59
[31]
J. J. Tompson, A. Jain, Y. LeCun, C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. Proc. 27th Int. Conf. Neural Inf. Process. Syst., 2014, 1: 1799
[32]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529(7587): 484
[33]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529
[34]
H. Huang, Y. Cai, H. Xu, H. Yu. A multiagent minority-game-based demand-response management of smart buildings toward peak load reduction. IEEE Trans. Comput. Aided Des. Integrated Circ. Syst., 2017, 36(4): 573
[35]
M.HesselJ.ModayilH.Van HasseltT.SchaulG.OstrovskiW.DabneyD.HorganB.PiotM.AzarD.Silver, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32 (2018)
[36]
S. P. Zhang, J. Q. Zhang, L. Chen, X. D. Liu. Oscillatory evolution of collective behavior in evolutionary games played with reinforcement learning. Nonlinear Dyn., 2020, 99(4): 3301
[37]
L. Wang, D. Jia, L. Zhang, P. Zhu, M. Perc, L. Shi, Z. Wang. Lévy noise promotes cooperation in the prisoner’s dilemma game with reinforcement learning. Nonlinear Dyn., 2022, 108(2): 1837
[38]
J. Xu, L. Wang, Y. Liu, H. Xue. Event-triggered optimal containment control for multi-agent systems subject to state constraints via reinforcement learning. Nonlinear Dyn., 2022, 109(3): 1651
[39]
S. P. Zhang, J. Q. Dong, L. Liu, Z. G. Huang, L. Huang, Y. C. Lai. Reinforcement learning meets minority game: Toward optimal resource allocation. Phys. Rev. E, 2019, 99(3): 032302
[40]
S. P. Zhang, J. Q. Zhang, Z. G. Huang, B. H. Guo, Z. X. Wu, J. Wang. Collective behavior of artificial intelligence population: Transition from optimization to game. Nonlinear Dyn., 2019, 95(2): 1627
[41]
S. P. Zhang, J. Q. Zhang, L. Chen, X. D. Liu. Oscillatory evolution of collective behavior in evolutionary games played with reinforcement learning. Nonlinear Dyn., 2020, 99(4): 3301
[42]
A.V. BanerjeeE.Duflo, Poor economics: A radical rethinking of the way to fight global poverty, Public Affairs, 2012
[43]
C. J. Watkins, P. Dayan. Q-learning. Mach. Learn., 1992, 8: 279
[44]
M. Cao, A. S. Morse, B. D. Anderson. Coordination of an asynchronous multi-agent system via averaging. IFAC Proceedings Volumes, 2005, 38(1): 17
[45]
H. L. Zeng, M. Alava, E. Aurell, J. Hertz, Y. Roudi. Maximum likelihood reconstruction for Ising models with asynchronous updates. Phys. Rev. Lett., 2013, 110(21): 210601
[46]
J. Q. Zhang, Z. G. Huang, Z. X. Wu, R. Su, Y. C. Lai. Controlling herding in minority game systems. Sci. Rep., 2016, 6(1): 20925
[47]
K. Binder. Theory of first-order phase transitions. Rep. Prog. Phys., 1987, 50(7): 783
[48]
K. Binder. Applications of Monte Carlo methods to statistical physics. Rep. Prog. Phys., 1997, 60(5): 487
[49]
G. Grégoire, H. Chaté. Onset of collective and cohesive motion. Phys. Rev. Lett., 2004, 92(2): 025702
[50]
M. Nagy, I. Daruka, T. Vicsek. New aspects of the continuous phase transition in the scalar noise model (SNM) of collective motion. Physica A, 2007, 373: 445
[51]
J. M. Encinas, C. E. Fiore. Influence of distinct kinds of temporal disorder in discontinuous phase transitions. Phys. Rev. E, 2021, 103(3): 032124
[52]
A.D. Sokal, Course 16 - Simulation of Statistical Mechanics Models, Elsevier, 2006
RIGHTS & PERMISSIONS
Higher Education Press
AI Summary 中Eng×
Note: Please be aware that the following content is generated by artificial intelligence. This website is not responsible for any consequences arising from the use of this content.