1 Introduction
The emergence of collective behaviors [
1–
10] is an intriguing enigma in self-organizing systems, which is ubiquitous in living systems (e.g., starling flock, ant colony, fish swarm and microbial swarms), human social systems (e.g., the formation of social norms, political behavior and collective decision-making), and computer simulation systems (e.g., Game of Life). As a typical self-organizing systems, resource allocation systems that a basis of the modern economy and society emerge diverse collective behaviors, in which the mechanism of minority wins [
11] often dominates individual behavior, and patterns of collective behavior of individuals are critical to the efficiency and functional realization of systems.
To uncover the dynamical mechanism of collective behavior evolution in resources allocation system, a large number of theoretical models involving minority games have been proposed [
12] and developed [
13–
23] in the last 20 years, aiming to reveal various collective behavior patterns and their corresponding dynamic mechanisms. The idea of minority game has been also widely used in the theoretical modeling and analysis of various real resource allocation systems, such as resource allocation in cloud manufacturing [
24], collective decision making in fish schools [
25], and arbitrageurs decision-making behavior in financial markets [
26]. Another focused core problem is to control the dynamical evolution of the system by means of an external intervention to suppress herd behavior, e.g., pinning control [
15], because the suppression of herd effect can not only effectively improve the efficiency of system resource utilization, but also maintain long-term stable sustainable resource supply.
In recent years, machine learning is advancing rapidly, especially as one of the learning paradigm reinforcement learning plays an important role in the field of artificial intelligence research. Reinforcement learning is an algorithm that makes optimal decisions through continuous interactive learning with the environment by trial and error. So far, it has been widely and successfully used in finance field [
27–
29], disease prediction [
30,
31], game decision system [
22,
32–
35] and other research fields. It is exciting to note that reinforcement learning can also be applied to complex systems with multiple agents. For example, cooperative behavior can be promoted by Lévy noise and some interesting collective modes of oscillatory cooperation have emerged in evolutionary game systems with reinforcement learning [
36–
38]. This will provide a new perspective for the challenges of revealing the emergence mechanism of collective behavior. As shown in our previous work [
39–
41], the resource allocation systems combined Reinforcement learning with delayed rewards and minority game can emerge with self-restoring collective behavior to resist outside interference, and find that the system can evolve to the optimal resource allocation state without external intervention.
However, the reward mechanism or the duration of the reward can make a big difference in the effect of social group behavior based on a series of social surveys [
42], and the authors also emphasized that weaker performers need to be rewarded immediately, rather than in the distant future to address their poverty. Inspired by this thought, we further investigate the emergence of collective behavior in resource allocation systems with the reinforcement learning manner under the condition of immediate rewards, and the decision-making. Without loss of generality, our multi-agent AI systems compete for limited resources in order to maximize their own payoff through Q-learning algorithms, and it is just that the reward mechanism is different from our previous work [
39]. Some typical questions are whether the resource allocation can still be optimized under the immediate reward to agent, whether herding can be effectively inhibited, and what kind of paradigm of collective behavior can emerge. In this paper, we address these questions of how a multi-agent AI system can regulate the operation of the entire game system based on the suppression of adverse dynamic behavior. In addition, we also analyze the effects of various learning parameters and system size on the oscillatory evolution of collective behavior in the AI systems. Addressing these issues is of paramount importance because the establishment of such a framework can further understand the collective behavior emergence and its mechanism of large-scale multi-agent complex game systems. It is also the primary step to design human-machine system, which is the inevitable trend in the future.
This paper is organized as follows. In Section 2, we introduce reinforcement learning algorithms with the immediate reward mechanism into minority game systems (RLMG systems). To show the evolution of collective behavior of the AI systems, some major numerical simulation results are presented on multiple combinations of learning parameters in Section 3. We also revealed an interesting and important phenomenon that first-order phase transition appears with exploration rate in the multi-agent AI systems and two phases can also be observed (period-two oscillatory mode and non-period oscillatory mode) in Section 4.1. In order to further understand the learning process of the agent, we analyze the conversion path between belief modes in Section 4.2 and give the self-organizing condensation phenomenon of belief mode in the appendix. Moreover, a finding is that ergodic breaking of the system occurs and resource allocation performs better than random systems without external interference for a small exploration rate in Section 4.3. Then, in Section 4.4, a detection method for period-two oscillation collective pattern emergence based on the Kullback−Leibler (KL) divergence is introduced and give the parameter position where the period-two appears. Finally, a discussion and conclusion is provided in Section 5.
2 RLMG model
Without loss of generality, our system contains agents competing for limited resources denoted by , and . The capacity of each resource is set , which is the maximum number of agents that each resource can hold. For simplicity, the resource capacity is consistent, which is . The vector to reflect the agent’s preference for resources in time . The component represents the proportion of agents with resource . If , the individuals who chooses the resource are winner because the number of agent dose not exceed the maximum load of the resource in this round. On the contrary, if , the agents are loser due to resource overload in this round.
Q-learning is a typical reinforcement learning algorithm [
43], it is used to combine with the minority game in our model. The goal of the agent empowered by Q-learning is to learn an optimal strategy through continuous interaction with the environment. That is, starting from the current state, it maximizes the expected value of the total reward in any and all subsequent steps. The relationship between the state of each agent, the actions available and the rewards obtained can be parameterized by the Q-function. Strategies that obtain higher rewards is strengthened by updating the Q-function in each round game. In order to seek for higher rewards, the agents always try to improve their wisdom via the Q-function and trial and error during the course of the game among agents. The update of the agent status and Q-function are just equivalent to the synchronization updating of Monte Carlo simulations [
44,
45]. Each agent is empowered by reinforcement learning algorithm: Q-learning. The available state set is
and its elements indicate the resource label that each agent can occupy, and the action set is
and its elements represent the actions that each agent can choose in any state
for Q learning algorithm. In most typical Q-learning tasks, the elements in
and
are distinguished by their environment properties. They are just same in our minority game model for both the action set and the state set.
The function is a time-dependent memory matrix combination as
The rows of this matrix represent the possible states of the agent, and the columns represent actions. If the agent is in the state and takes action at the time step , according to the Bellman equation, the element of this matrix is updated as follows:
where the subscripts and denote the current state of agent and the action that the agent may take, respectively, is the learning rate, and is the reward obtained by the agent for taking action in time step . The parameter is the discount factor that measures the impact of future rewards. Agents with are short-sighted that they only care about current interests, while those with the larger value of are considered for a longer term. is an optimal estimate of future values at , the state is derived from the output of action based on at the current time step. According to Eq. (1), Q-function is the accumulation of historical experience through the iterative update of memory matrix elements, it reflects the quantitative relationship between state, action, and reward. As the time to explore the environment increases, the Q-function performs better and better based on the feedback of the action reward.
In our model, each agent in the system has its own memory matrix . As it starts to interact with the environment, the adaptability of the Q-function will increase rapidly. Just like the basic steps of Q-learning algorithm, while agents make decisions mostly based on their own memory matrix, it is also necessary to select the action with a certain randomness for trial and error. In summary, for a given set of parameters and , the Q-learning algorithm is shown in Fig.1:
Fig.1 The flow chart of protocol for Q-learning minority games. Green arrows indicates that agent takes action following -function in the logic diagram. is an arbitrarily given small constant and . |
Full size|PPT slide
1) Randomly initialize all agents’ memory matrix and their state .
2) Each agent takes the action following -function with probability , that is
or randomly selects an action with probability (also known as exploration rate) for each round. Meanwhile, if the action leading to the state is the current winning (minority) state, the agent gets a reward from the environment. Instead, the state is the failed (majority) state, the reward .
3) The agent updates the element value of memory matrix following Eq. (1), and its state is also updated as .
4) Repeat the process 2)−3) until the system becomes stable or reaches a preset termination condition.
As stated in Ref. [
42], differences in reward mechanisms can lead to very different dynamics behavior. If the feedback reward was delayed by a time step in our previous work [
39], an explosion oscillation collective behavior that tends to the optimal state emerges in that system. However, in our current minority game system, the agent empowered by reinforcement learning receives an immediate reward for an action during the process of understanding environment. A remarkable feature is that the collective behavior presents the gentle oscillation around the optimal state without the explosion phenomenon, and as the exploration rate
increases, the system undergoes a phase transition. In addition, our minority game system combined with RL is also different from other studied game systems [
15], because our system takes into account the complex coupled memory taken action by agent in the past and adjusted by parameters
,
and the reward
to optimize the allocation of resources.
3 Simulation results
In this paper, we focus on the case with two resources () that the simplest minority game model with Q-learning agents. Thus, each agent could be in one of two resources (states) or , and it could choose one resource as its action or . The Q-function can be expressed as
Actually, Q-function is the Cartesian product form about the state set and the action set . The resource capacity is limited to . On the evolution beginning, all the elements in agents’ Q-function are randomly initialized in the interval of , and the agents’ states are randomly specified from .
Here, the occupation ratio of the resource denoted by is defined as
where is the number of agent selecting resource , is the state of agent and is Dirac delta function, and if the state of agent is at time , otherwise . Obviously, the optimal resource allocation scheme is in the RLMG system. In order to measure the performance of the RLMG system, a commonly used metric is
which reflects the fluctuation of
deviating to the optimal allocation of resources over a long period [
15]. In order to eliminate the influence of limited system size, the form
is used.
Fig.2 shows of the RLMG systems as the function of the exploration rate . A striking result is that jumps discontinuously at a specific exploration rate denoted by , which is widely observed in different parameter combinations as shown in Fig.2(a)−(c). This discontinuous jump indicates the dynamical behavior of the system undergoes a first-order phase transition and the transition behavior is further discussed in Section 4.1. In addition, for fixed discount factors and as shown in Fig.2(a) and (b), the transition point decreases monotonically as increases. Fig.2(c) shows that also relies on discount factors for a fixed learning rate but it has no monotonic relation. For all combinations of learning parameters and , is independent of system size except for the region close to 1, as the occupation behavior of agents is almost random and breaks the collective mode as shown in Fig.2(d). From simulations of different learning parameters, it is find that the qualitative behaviors of with respect to are identical except for the position of the transition point .
Fig.2 Curves of as the function of the exploration rate with different learning rates and discount factors. The parameters , and the system size in each subplot are (a) and ; (b) and ; (c) and ; (d) and . The straight line in each panel represents the standard deviation of a binomial random system, i.e., . (d) Three lines of different colors represent different system sizes. The function of each agent is initialized randomly before simulations. |
Full size|PPT slide
The discontinuous phase transition breaks the regular oscillation of the time series
. Taking learning parameters
and
[the orange diamond symbol in Fig.2(b)] as a concrete example, the evolutionary time series of
near and far from the transition point
are presented in Fig.3. A stable period-two oscillation appears in the range
, while it is broken and turns into an irregular oscillation for
. In fact, stable period-two oscillation is a common collective behavior in minority game models [
14,
15,
46], and it is a kind of herd behaviors which generates inefficient resource allocation. To suppress herd behaviors an outfield control, such as pin control [
46], often needs to be applied. However, the RLMG system composed of AI agents can effectively suppress herding at very low exploration rate and organizes itself into an optimal resource allocation state by breaking regular oscillation as shown in Fig.3(a). With the increase of
, the amplitude of the irregular oscillation is enlarged rapidly, eventually leading to a stronger herding behavior. Fig.3(d)−(h) show temporal evolutions of
at the right of the transition point
. As
increases, the amplitude of the regular oscillation first increases and then decreases, which corresponds to the non-monotonic behavior of
in Fig.2 and the peak position of
is denoted by
. Further research shows that the amplitude of the regular oscillation increases with
and
, respectively, and
decreases with
, but there is no obvious pattern for discounting factors
. For any given learning parameters combination
and
, the amplitude and
are independent of the system size
. As
, the amplitude of the periodic oscillation gradually decreases until it is dominated by noise, and
completely degenerates into a random oscillation, since all AI agents randomly access any resource
.
Fig.3 The evolutionary time series for the different . The exploration rates s are smaller than in (a, b). The time series close to the transition point is exhibited in (c). Those for are shown in (d−h). The red dashed line indicates the average occupation ratio . The learning parameters and the number of agents are identical, i.e. , , and . |
Full size|PPT slide
4 Analysis
4.1 The discontinuous transition
In this subsection, we investigate the jump of resource load standard deviation . A major question is whether the jump of means that the intelligent system undergoes a first-order phase transition. To answer this question, the collective oscillation modes on both sides of the transition points, shown in Fig.4(a) and (b), have to be distinguished. At the right side of the transition point, exhibits a relatively stable period-two oscillation mode (PTOM), for example in Fig.4(a), and this mode only maintains itself until the noise intensity overwhelms it at . This is a robust dynamic phase in RLMG systems because this oscillation pattern can not be destroyed with finite noise, which is discussed in Section 4.4. It is worthy to mention that the amplitude of this oscillation mode shows a non-monotonic varying with the increase of , but the period-two oscillation is still maintained. However, the nature of this non-monotonic form of (a single oscillation mode determined only by the oscillation amplitude) is the result of competition between two driving, namely, the AI agent’s exploration (reflected by ) and the pursuit of profit maximization (reflected by memory matrix Q-function). The position of reaching its maximum is denoted by . In the interval , the revenue of agents dominates the evolution of the system, and the degree of herding behaviors are enhanced with . In the interval , the exploration of agents plays a leading role, which makes the elements in the Q-function to be randomly selected and updated. Therefore, herd effect is suppressed to some extent, but the optimization performance is not better than random systems.
Fig.4 Examples of several oscillating modes of resource load are in the RLMG system, respectively , , , . The learning parameters are and , and the system size is . (a) A two period oscillation mode. (b, c) A non-period oscillation with different amplitude . (d) The switch between period-two and other oscillation modes near . (e) The power spectrum of with the period . |
Full size|PPT slide
On the left side of the transition point , there are abundant oscillatory evolution modes mainly manifested by the difference of oscillation frequency and amplitude in RLMG system. These oscillations are called non-period oscillation mode (non-POM). It should be emphasized that, for a given exploration rate , the evolution of switches flexibly between different oscillatory modes instead of a fixed oscillatory evolution pattern as shown in Fig.4(b) and (c). To depict the oscillatory behavior, we investigate the power spectral of which is defined as following:
where is the Fourier transform of with length . Fig.4(e) shows two examples of how the power spectrum is distributed over the period on the left of . This is a special dynamic phase of MG systems with reinforcement learning agents. As the exploration rate approaches the transition point, an obvious switch occurs between the two phases as shown more intuitively in Fig.4(d), i.e., PTOM phase and non-POM phase. One can see from Fig.4 that the self-organized collective behavior emerges in the system for . In an interval of greater exploration rate , the period-two oscillation mode keeps while the amplitude changes. The key factor is that in this exploration rate interval, the Q-function can self-organize into a very robust structure. In contrast, for a small exploration rate , the matrix structure is fragile, so that a variety of structures emerge and are owned by small systems of different sizes.
To analyze the transition between PTOM and non-POM, the order parameter is defined as following:
where is the symbolic function that for and for negative . The length of the system time series that we take is . The order parameter is within . If , the system is oscillated with period two (PTOM), i.e., in ordered phase. Otherwise, if , the period-two oscillation is broken, which means it turning into non-POM. Taking learning parameters and as an example, in which , changes suddenly as shown in the illustration of Fig.5(a), which is consistent with the jump point of in Fig.2(b). As increases, the order parameter decreases gradually, which reflects the destruction effect of noise on the periodic oscillation, and the exploration rate is denoted by that the period-two oscillation is exhaustively broken for randomness of exploration.
Fig.5 Some phenomena near the phase transition point include critical slowing down, hysteresis like loop, Binder cumulant moment, etc. The learning parameters are and , and the system size is . (a) Binder cumulant moment as a function of explore rate , the insert is versus . (b) The critical state lifetime is taken as an exponential function of system size , namely . The solid green line is the fitting data. (c) with the hysteresis loop shown by , the blue solid circle is increasing for and the yellow square is decreasing for . (d) The gap of with system size between bistable states near . The solid yellow line is the fitting data. |
Full size|PPT slide
According to Binder’s research about phase transition [
47,
48], the transition should be a first-order phase transition, if Binder cumulant moment of the order parameter changes to a negative value at the transition point
. For the transition breaking PTOM, the Binder cumulant moment
of
is defined as
is shown as a function of in the main panel of Fig.5(a). Obviously, when reaches where the discontinuous jump of occurs, changes suddenly to a negative value. This is a significant evidence that the system undergoes a first-order phase transition at . However, it is also noticed that also decays to a negative value at . One possible reason is that there is additional complex oscillatory mode switching at very low noise intensity. Unfortunately, this switch is difficult to detect in more detail.
This transition also has other general characteristics of the first-order phase transition. Constructing the evolution of system by slowly increasing (or decreasing) exploration rate
, the hysteresis loop phenomenon of
near
is observed as shown in Fig.5(c), which is in accordance with other discontinuous phase transitions [
49–
51]. Ref. [
52] has emphasized that a more severe slowing-down occurs at the first-order phase transition point, since the free energy barriers separate the ordered and disordered phases. The dynamics of the system present an exponential slowing-down with the system size
near transition point. To detect the slowing-down form, the lifetime
of the PTOM for a fixed system size
is defined as follows:
where is the total number of time steps of a complete PTOM between two adjacent switching about modes of oscillation, and is the total number of PTOM in a long enough time series of . In our system, the lifetime is also found to have the exponential slowing-down at the transition point. That is, the duration of PTOM or non-POM as a exponential function of system size , i.e., [see Fig.5(b)]. This indicates that the RLMG system has structure of bistable state around the transition point, which is an typical feature of first-order transition. Fig.5(d) demonstrates that the gap of between PTOM and non-POM at transition point does not vanish with system size .
4.2 The formation and transformation of belief mode
Because the AI agent has only two states at any time, namely or , and the actions it can take for each state are or . Each AI agent must take an action to participate in the minority game for maximizing its payoff based on feedback from the environment through reinforcement learning algorithm, and then update its Q-function and state. As we all know, AI agent constantly try and error with a certain probability to explore the optimal strategy in typical Q-learning algorithms, or take action following -function with the probability , see Fig.1. If an AI agent makes decisions based on -function, its Q-function structure is further solidified. Conversely, if the AI agent takes a random action to explore the environment, the Q-function structure is slightly broken down and reconstructed. Therefore, agent learning can be broken down into two processes, the former is called memory reinforcement updating process (mR-process) and the latter is trial and error or exploration updating process (tE-process). Specifically, if tE-process occurs with a small probability , it can be regarded as a disturbance to mR-process.
In order to further understand the dynamic behavior of RLMG system, we tracked the learning process of a randomly selected AI agent adapting to the environment. The belief mode of AI agent based on Q-function and state as is defined, which is the action corresponding to the largest element in the row of . Therefor, for a fixed set of states and fixed set of actions , the belief mode set is represented as and is time-dependent. The belief mode that an agent belongs to indicating a preference for a certain resource. That is, the AI agent’s belief is the action when the state is at the current moment for the belief mode . To quantify the strength of AI agent preference for resources, we defined the belief mode function as follows:
where is the average gap between the largest element and other elements at in the row of state . The larger the gap, the stronger the robustness of belief mode, which implies a more persistent preference of agent for action . Based on Eq. (9), we find that the number of belief modes depends on the dimension of the Q-function. A larger Q-function dimension corresponds to a larger number of belief modes. For example, there are nine belief modes for a matrix and eight belief modes for a matrix.
In our system, only two resources are set, called and , then , and , . Therefor, there are four belief modes: , , and for each AI agent in this two-resource RLMG system. For mR-process and tE-process, the transformation of belief mode can be fully represented by a double-coupled directed network, in which belief mode is the node of the network and directed edge is the transformation direction between modes. The coupling between the two layers reflects the interaction of the two processes as shown in Fig.6. Certain transitions occur between modes in the process of learning and adapting to the environment. The belief mode of the AI agent needs to go through the fixed evolution path, which includes two processes: mR-process (Solid arrow, the top layer in Fig.6) and tE-process (Dash line, the down layer in Fig.6). Each transformation between belief modes, including the self-transformation (namely, ), is accompanied by the update of the corresponding matrix elements (the end of the arrow). If only according to the mR-process, AI agent’s belief mode transformation path is as follows: , which is an ordered closed loop. However, the occurrence of -process destroys the orderliness of this closed loop and leads to a jump transformation between modes by adopting the opposite strategy of the -function.
Fig.6 Evolution scheme of belief mode in two-resource RLMG system. There are two types evolution paths of belief mode: mR-process and tE-process. In mR-process, the evolution paths of belief mode are represented by solid lines (the top layer). In -process, the evolution paths are represented by dash lines (the down layer). The arrows between the layers indicate that the AI agent’s belief mode itself is maintained through either mR-process or tE-process. At the end of the arrow, the element of Q-function is updated when the mR-process or tE-process occurred. |
Full size|PPT slide
The belief mode of each AI agent in our formalization is reflected by its own matrix and current status. For a given game dynamics, the essence of agent learning is the mutual conversion of belief modes to maximize returns. As in our RLMG system, the and modes flow more toward the and modes until the system reaches a dynamic equilibrium for a fixed exploration rate shown in Fig. A1 of the Appendix, that is, the self-organizing condensation of belief modes appeared in the system. There exists strong nonlinear correlation effect between mR-process and tE-process in the process of agent adaptive reinforcement learning. Different from the noise disturbance of classical nonlinear dynamics system, tE-process not only perturbs the agent’s current decision, but also perturbs or unlocks its belief mode which may be responsible for the oscillatory convergence phenomenon of the system. It will then indirectly affect the subsequent decision-making of the agent through the mR-process with greater probability. The evolutionary behavior of belief modes is discussed more detail by means of visualization in the appendix.
4.3 Initial sensitivity for low exploration rate
In this subsection, we focus on the dynamic behavior when the exploration rate approaches . An interesting phenomenon is that RLMG system shows more efficient resource allocation than a random system when is close to for almost all possible combination of parameters (see Fig.7). In other words, optimal allocation of resources gradually emerges under the condition of small noise in RLMG system. The solid red lines in Fig.7 represent the resource load fluctuation of a complete random system, which is a constant . A large number of numerical results show that the RLMG system has an optimal exploration rate , i.e., the exploration rate corresponding to the minimum , where . This is a very significant finding, which means the deviation of resource load is independent of system size . For the constraint of Q-function in reinforcement learning algorithm with the small noise, the system can emerge spontaneously with a collective mode that effectively suppress herding behavior.
Fig.7 The initialization sensitivity and the self-organization of optimal resources allocation. The learning parameters are and , and the system size is . (a) The initial occupation ratio of resource is , i.e., , and the Q-function has different initializations, which are , , and random in intervals . (b−d) The system has a given initialization Q-function and the initial occupation ratios are different, respectively, , and . (b) The Q-function is randomly initialized in interval . (c) , (d) . |
Full size|PPT slide
In addition to the self-organizing optimization, the collective behavior of the system also shows sensitivity to initial conditions with low exploration rates. The initial occupation ratio of any resource and initial element distribution of agents’ Q-functions can both affect of future system. As shown in Fig.7(a), for the Q-function initialization of three different structures, which are , , and random initializing all elements of in range of , respectively. The presents different optimal position , while they all have the same initial occupation ratio . In Fig.7(b) and (c), the initialization structure of Q-function of the AI agents in the system is the same, but the s are different. The system is observed to have the same optimal position for a given initialization, but the curves of are distinct. This phenomena are similar to the glass transition observed in some theoretical models of glass, that is, the system undergoes a glass transition when the temperature, which corresponds to the exploration rate in our system, drops rapidly. The system undergoes a process from ergodic to ergodic breaking state. An interesting finding is that the optimal position of the system resources configuration is uniquely determined by the initialization of the table, independent of the initial proportion of the resource load . The relative value of the initial Q-function elements plays a major role in the position of .
4.4 The period-two oscillation mode breaking at ϵ = 1
As mentioned in the previous Section 4.1, the period-two oscillation mode gradually loses stability and cannot be identified when exploration rate is close to 1 in the multi AI-agent system. An important question is whether this regular oscillation pattern (i.e., PTOM) can be destructed for finite exploration rate or it is an finite size effect. To investigate this problem, the KL divergence is introduced to depict the destruction behavior of the collective oscillation:
Specifically, KL divergence is used to measure the difference between two distributions. In this case, the two distributions correspond to that the numbers of agents occupying resource and that the distribution of with , respectively. For period-two oscillation, is a typical bimodal distribution. As approaches 1, is going to degenerate to Gaussian distribution, denoted by .
Fig.8(a) demonstrates the bimodal probability density function (PDF) with (green) and (red). The emergence of bimodal structures distribution indicates that the multi-agent system is manipulated by the reinforcement learning algorithms, and thus deviated from the random system. Moverover, the bimodal structure further implies that a period-two oscillation mode emerges in the system through dynamical bifurcation. The gap between two peaks increases with the exploration rate decreasing, that is, the more distinct PTOM behavior can be observed. Different shapes of empty symbols represent different system sizes. Interestingly, it is found that the PDF of various system sizes can be re-scaled to a single curve by a simple relation . This is consistent with that is independent with system size in period-two oscillation region, shown in Fig.2(d).
Fig.8 As moves away from , the period-two oscillatory collective behavior of the system can be observed gradually. (a) It is statistical distribution of the resource load for different system sizes, based on the ensemble average of a longer time series when the system reaches steady state, different color curves correspond to different exploration rates , the red empty symbol is , the green is and the blue shade (A completely random system). The system shares learning parameters and . (b) It is KL Divergence of resource load distributed between the given exploration rate and a random system for different combinations of learning parameters and . The insert shows the trend of KL deviation in linear coordinates without re-scaling. The system size . |
Full size|PPT slide
Fig.8(b) shows the variation curve of KL divergence with in logarithmic coordinates. The KL deviation with the exploration rate re-scale relationship is given as following:
for different system sizes. This power law behavior can show that the oscillatory collective behavior of period-two does not suddenly appear under certain , but is always present in the interval . This is analogous to the phase transition at the critical value , where only one phase exists in , namely, PTOM. Therefore, the order of the system begins to decrease at , simply because the strong noise drowns out the oscillating pattern of period-two. In addition, we found that the KL divergence is not independent of the system size when , given by the different shape symbols of certain color, but also does not depend on learning parameter combinations, given by the different color symbols. For clarity, the gap between curves of different colors, including green symbol , yellow symbol and blue symbol , shown in Fig.8(b) is caused by artificial translation.
5 Discussion and conclusion
The emergence of collective behavior from the various complex systems which consist of a large number of interacting components is universal in ecosystem, social and economic system. Collective behavior can be positive, such as efficient teamwork, bird foraging and animal migration, or negative, such as investor herding in the stock market, stampedes and traffic jams. The herding behavior is one of the typical collective behaviors of complex resource allocation systems and can spread quickly through the system causing it to collapse. The best performance of such a system is usually measured by the sustainable and maximum use of resources by all individuals in the system. One of the core objectives of management resource allocation system is to avoid the occurrence of collective behavior with herd property through reasonable mechanism regulation. Machine learning paradigm provides a new perspective for the study of collective behavior emergence in complex systems. In addition, artificial intelligence will increasingly penetrate into all aspects of human society, such as self-driving car clusters and drone clusters, authorized by various machine learning paradigms in the future. Therefore, it is of great practical significance to explore the emergent of collective behavior and its mechanism through the marriage of AI and complex systems, especially the adaptive regulation of collective behavior evolution in complex systems consisting of a large number of AI agents. More generally, the main goal of our research is how machine learning in AI agents system with immediate reward affects collective dynamics in complex systems.
In this paper, we introduced the minority game, a typical model of resource allocation systems, incorporating reinforcement learning (RL): Q-Learning algorithm, by building individual simplified AI models. In our AI agents complex system, the individuals are intelligent to some extent, and they are capable of reinforcement learning. Intelligent groups can self-organize and evolve to reach a predetermined goal based on the feedback reward of the environment and the simplified Q-learning algorithm. With the exploration rate increasing (similar to the temperature of thermodynamic systems), an interesting transition phenomenon is also found in AI complex systems, that is first-order phase transition. Further research shows that two phases can also be observed for the AI system: period-two oscillatory evolution mode and non-period oscillatory evolution mode. This means that the system is regulated by the exploration rate of the learning parameter in the reinforcement learning algorithm, and the multi-agent system can emerge with two kinds of collective behavior (regular stable oscillation and irregular oscillation). As with traditional thermodynamic systems, a exponential slowing-down has also been discovered near the transition point . In short, different from the existing complex multi-agent game system with reinforcement learning, in our system, firstly, the agent has a simple memory matrix (Q-function) with low computational complexity. Secondly, each agent decision is independent, and the interaction between agents only comes from the feedback reward of the environment. Third, based on this simple model, intelligent game groups can realize self-organizing optimization for a given combination of parameters and switching behavior patterns.
In addition, an important phenomenon is that when the trial-and-error rate of the agent is very low, the system self-organize into a state of optimal resource allocation without any external intervention at , which is much more optimized than the random system. This indicates that large-scale complex systems composed of AI individuals can realize self-organization and collaboration based on individual interaction and reinforcement learning, and then emerge positive collective behaviors under certain learning parameters. The definition of belief mode and its transformation path give the emergent paradigm and evolutionary mechanism of the collective behavior of multi-agent systems from a more microscopic perspective. For example, for a specific parameter combination, with the advance of the learning process, mode and mode gradually dominate, while mode and gradually erode, but they will not disappear. By defining the belief modes approach, a clearer physical picture is presented to understand the evolution of the RLMG system dynamics behavior. We give the identity transformation path of an AI individual when it interacts with the environment, and the transformation path is universal regardless of the learning parameters for a given game dynamics. Then the emergence of period-two oscillation modes is demonstrated under high exploration rate, and concluding that the collective behavioral emergence of AI complex systems is qualitatively strongly dependent on the exploration rate and the learning parameter which can determine the point of the phase transition .
Finally, in our RLMG system there are four key exploration rates summarized below:
,
,
,
. The smallest
implies an optimal resource allocation.
represents the transition point of the first order phase transition,
means the system enters an trade-off between the two driving, namely the Q-function and the noise effect, and the maximum
indicates that the period-two oscillation mode is gradually destroyed by noise and cannot be clearly observed. As our theoretical results suggest that slightly different reward mechanisms have a great impact on the evolution of collective behavior in real social systems [
42], for example, a delayed reward was set up in our previous work [
39] which emerged the collective behavior with periodic explosions and can self-organize recovery against external disturbances. For immediate reward settings, however, the system exhibited a more varied and mildly oscillating behavior that optimizes resource allocation.
Our work can provide a basic theoretical framework for the integration of machine learning and complex systems. The systematic study of complex game systems from the perspective of machine learning can further help us to understand the emergence and evolution of diversified oscillating collective behavior patterns and their mechanisms. Because agents with a certain level of intelligence are closer to the activity pattern of biological groups. Based on our findings, there are still a lot of concrete unanswered questions that are very interesting. Different from the traditional game system, the agent with reinforcement learning ability can form a preference belief mode for alternative actions by its Q-function. How the flow between belief mode drives the emergence of collective behavior in complex system is a significant and interesting question. In addition, from control point of view, by adjusting the learning parameters of the machine learning algorithm, system size or game parameters, it is of practical importance to make the system emerge positive collective behavior, or restrain unfavorable collective behavior.
6 Appendix: Self-organizing condensation of belief mode in RLMG system
In this section, we focus on the evolutionary behavior of belief modes as the agent becomes familiar with the environment. For simplicity, the resource configuration system still sets two resources as and . The set of belief mode contains only four elements: . In Section 4.2, the set of belief modes and the belief strength are defined for AI agents in RLMG game system [Eq. (9)]. One of our concerns is the evolution behavior of belief modes as the system moves from a random initial state to a steady state for the AI system. Here, a concrete example is given to shown the belief evolution in the AI system that the learning parameters are set to , and .
We take snapshots of the system during training and observe the evolution of four belief modes as shown in . The occupancy of the agent for the four belief modes is nearly equal to about as shown in (a), due to the random initialization of matrix and the state of the agent at the initial time . Belief modes begin to flow or transform between them with the evolution of AI system or agents learning because the elements in the Q-function are updated to adapt the dynamic environment. Therefor, the ratio of belief modes changes rapidly adaptation in the snapshot (b) at time . That is, the occupancy of mode is close to , and the occupancy of mode is close to . This indicates that the inflow in mode is greater than the outflow, and the net flow is positive. As the system is trained, for example, until time step or a longer time in (c) and (d), the net flow of the mode or becomes negative. In fact, the property of belief modes and , which is a speculative belief, is to cause the agent to change decisions at adjacent moments, and trigger the emergence of a oscillations collective behavior. For certain reward mechanisms, the oscillating belief mode is absorbed by other belief mode to maximize overall RLMG system benefits. As the evolution of the RLMG system is close to stable, the occupancy of oscillatory modes of and decreases gradually, large-scale condensation of agents occurs on and modes. Therefore, modes and are more like a medium that can guide the system into a high ordered collective behavior for any learning parameters. But the resource load temporal is disordered or non-period oscillation in the left of the transition point .
Fig.A1 Evolution snapshot that the agent’s occupancy ratio of four belief modes for the RLMG systems in time step , , , , respectively. The learning parameters is , and the system size . All agents are fixed to nodes of the regular grid in the RLMG system. If the agent’s belief mode is , its corresponding node is highlighted in grid with belief mode . The color indicates belief strength [Eq. (9)]. (a1−d1) The grids with belief mode. (a2−d2) The grids with belief mode. (a3−d3) The grids with belief mode. (a4−d4) The grids with belief mode. The number at the top of each subplot indicates the occupancy density of the belief at the current moment. |
Full size|PPT slide
We summarize the evolutionary condensation of belief modes into the following steps with four properties: (i) The belief modes of all agents in the system are occupied uniformly and randomly at the initial time. (ii) With the self-organizing learning of the AI system, the belief modes and show large asymmetric fluctuations. The system is in a large oscillation, because the main occupation states of the agents at this stage are and . (iii) In the third stage of system evolution, the proportion of belief mode states starts to change relatively smoothly. The oscillation modes and gradually flow to and . Therefore, the resource preference is in a stable oscillation state, and the oscillation amplitude is gradually reduced. (iv) The system converges to a relatively stable state and forms a dynamic equilibrium. A large number of agents condense on and belief modes.
In a word, as the RLMG system is constantly trained, a large number of agents gradually self-organize on the and belief modes, and then reduce the fluctuation amplitude of the system. What is interesting is that this condensation of the and modes occurs for any given exploration rate , although the resource load is a non-period oscillation to the left of the phase transition point.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}