Open and real-world human-AI coordination by heterogeneous training with communication

Cong GUAN; Ke XUE; Chunpeng FAN; Feng CHEN; Lichao ZHANG; Lei YUAN; Chao QIAN; Yang YU

doi:10.1007/s11704-024-3797-6

PDF(3226 KB)

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (4) : 194314. DOI: 10.1007/s11704-024-3797-6

Artificial Intelligence

RESEARCH ARTICLE

Open and real-world human-AI coordination by heterogeneous training with communication

Cong GUAN¹^,² ,
Ke XUE¹^,² ,
Chunpeng FAN³ ,
Feng CHEN¹^,² ,
Lichao ZHANG³ ,
Lei YUAN¹^,²^,³ ,
Chao QIAN¹^,³ ,
Yang YU¹^,²^,³

Author information +

History +

Abstract

Human-AI coordination aims to develop AI agents capable of effectively coordinating with human partners, making it a crucial aspect of cooperative multi-agent reinforcement learning (MARL). Achieving satisfying performance of AI agents poses a long-standing challenge. Recently, ah-hoc teamwork and zero-shot coordination have shown promising advancements in open-world settings, requiring agents to coordinate efficiently with a range of unseen human partners. However, these methods usually assume an overly idealistic scenario by assuming homogeneity between the agent and the partner, which deviates from real-world conditions. To facilitate the practical deployment and application of human-AI coordination in open and real-world environments, we propose the first benchmark for open and real-world human-AI coordination (ORC) called ORCBench. ORCBench includes widely used human-AI coordination environments. Notably, within the context of real-world scenarios, ORCBench considers heterogeneity between AI agents and partners, encompassing variations in capabilities and observations, which aligns more closely with real-world applications. Furthermore, we introduce a framework known as Heterogeneous training with Communication (HeteC) for ORC. HeteC builds upon a heterogeneous training framework and enhances partner population diversity by using mixed partner training and frozen historical partners. Additionally, HeteC incorporates a communication module that enables human partners to communicate with AI agents, mitigating the adverse effects of partially observable environments. Through a series of experiments, we demonstrate the effectiveness of HeteC in improving coordination performance. Our contribution serves as an initial but important step towards addressing the challenges of ORC.

Graphical abstract

Keywords

human-AI coordination / multi-agent reinforcement learning / communication / open-environment coordination / real-world coordination

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Cong GUAN, Ke XUE, Chunpeng FAN, Feng CHEN, Lichao ZHANG, Lei YUAN, Chao QIAN, Yang YU. Open and real-world human-AI coordination by heterogeneous training with communication. Front. Comput. Sci., 2025, 19(4): 194314 https://doi.org/10.1007/s11704-024-3797-6

1 Introduction

Developing Artificial Intelligence (AI) agents that can coordinate with human partners (i.e., Human-AI Coordination, HAC) is a longstanding challenge [1,2]. Cooperative multi-agent reinforcement learning (MARL) [3–5] has been seen as a promising solution to many multi-agent decision-making problems. Current state-of-the-art MARL methods can build highly competent agents in cooperative environments [6–10]. However, those agents are often over-fitted to their training partners and cannot coordinate with unseen agents effectively [11–13], which is a fundamental challenge when applied to HAC [2]. Besides, an increasing number of practical tasks, particularly those involving open-world scenarios where important factors are subject to change, are emerging [14–17]. In such a setting, a trained HAC agent is expected to coordinate effectively with novel human partners who were not encountered during its training.

To tackle this challenge, recent works, e.g., ad-hoc teamwork (AHT) [18] and zero-shot coordination (ZSC) [11], have proposed some promising solutions [19]. One key insight of these methods is to expose the agent to diverse partners during the training process. A typical approach is two-stage [20–24], which first develops a diverse policy pool consisting of multiple partners, possibly covering various types of unseen partners. Then, an adaptive agent policy is trained against this policy pool. These approaches have achieved visible success, representing an important step toward addressing the challenges of open-world HAC.

However, the current work is primarily based on idealized assumptions and pays little attention to real-world challenges. Humans and AI agents in many real-world tasks are naturally heterogeneous [25,26]. That is, agents and partners have different observations and capabilities (including advantages, i.e., the efficiency of doing something, and abilities, i.e., the ability to do something). For example, the observations of AI agents are limited (e.g., due to damaged sensors) compared with human partners (which can actively obtain more useful information) in many scenarios, making it difficult for them to successfully complete tasks solely based on their own observations thus hard to coordinate effectively with human partners. Thus, taking the heterogeneous nature of humans and agents into consideration when training AI agents is necessary to accomplish open and real-world HAC (ORC) tasks. However, there are only a few preliminary attempts considering the heterogeneous setting (e.g., MAZE [27]). Besides, to the best of our knowledge, there is no work that simultaneously considers the heterogeneity in observations and capabilities of both AI agents and human partners, which is, however, a widespread and important problem in real-world applications. Training AI agents with open and real coordination abilities is still a challenging problem [28,29].

In this paper, we first formulate the problem of open-world coordination and real-world coordination, respectively. Then, we propose an Open and Real-world Coordination Benchmark (ORCBench) for this important but challenging task for the importance of the targeted benchmarks for HAC [30]. ORCBench includes popular environments that are commonly used in recent works. ORCBench not only considers the open challenge, i.e., different unseen partners with different unknown levels, but also the real challenge, i.e., agents and partners are heterogeneous, from the observation to the capability. Furthermore, we propose a basic framework, heterogeneous training with communication (HeteC), for solving the ORC problem. HeteC is built upon a heterogeneous coordination training framework and enhances partner diversity through mixed partner training and frozen archive. Additionally, HeteC introduces a communication module to facilitate coordination between human partners and AI agents in situations where there is limited information. As a versatile framework, HeteC can be applied to solve the ORC problem, and other related works in the field, such as ZSC and communication techniques, can be easily integrated into HeteC to further improve its effectiveness.

To validate the effectiveness of HeteC, we conducted experiments on ORCBench and compared it with existing methods [20–23,27]. Firstly, we demonstrate that the ORC environment poses challenges for the state-of-the-art HAC methods. Next, we evaluate the effectiveness of our proposed method on different layouts and with various masks (i.e., different limited observation spaces of agents). Our contributions are three-fold:

1. We formulate the ORC problem, and our aim is to train an agent that can effectively coordinate with unseen and heterogeneous human partners in this setting.

2. We introduce ORCBench, a benchmark specifically designed for open real-world human-AI coordination. ORCBench includes various environments and algorithms, where the agent and human partner have different abilities.

3. We present an efficient algorithm, HeteC, for solving ORC problems. Experimental results across different environments in ORCBench demonstrate the superior performance of HeteC. Moreover, our ablation studies confirm the effectiveness of each module in HeteC. The proposed ORCBench and HeteC serve as an initial but important step towards addressing the challenges of ORC, and we hope it could attract people’s attention to this setting.

2 Related work

2.1 Multi-agent reinforcement learning

MARL Many real-world problems involve multiple interacting agents, which can be typically formalized as a Multi-Agent Reinforcement Learning (MARL) problem [31,32]. Moreover, when these agents share a common objective, it falls under the category of cooperative MARL [5]. Cooperative MARL necessitates agent coordination, and it has witnessed significant progress across diverse domains, such as path-finding [33], active voltage control [34], and dynamic algorithm configuration [35]. The previous classic methods can mainly be categorized into two groups: one is policy-based methods such as MADDPG [6] and MAPPO [9], and the other is value-based methods like VDN [7] and QMIX [8]. Recently, some works that utilize transformer architecture [36] explores new approaches to address cooperative MARL problems, demonstrating remarkable coordination abilities across a wide range of tasks, including the StarCraft Multi-Agent Challenge (SMAC) [37], the Hanabi challenge [38] and Google Research Football (GRF) [9]. In addition to these mentioned works considering general settings, some other more specific problem settings have been proposed to further advance the investigation of cooperative MARL, such as efficient communication [39], offline policy deployment [40], model learning in MARL [41], and policy robustness in the presence of perturbations [42,43].

Communication in MARL Communication is essential in MARL because it can allow agents to share their information (e.g., observations, experiences) with other partners to enhance coordination [39], which can significantly mitigate the challenges posed by partial observability inherent in the environment. Existing research on multi-agent communication primarily attempts to learn effective communication protocols from the following three aspects [39]: when to communicate, what to communicate, and whom to communicate with. Early related works consider simply broadcasting messages at each timestep to promote team coordination and adopt an end-to-end training scheme to generate useful messages [44,45]. Later on, some works employ techniques like gate mechanisms [46,47] to determine whom to communicate with, thus reducing the communication cost. MAIC [48] and some other works [49,50] answer the question of what to communicate by utilizing techniques like teammate modeling to generate more succinct and efficient messages. Besides, when receiving multiple messages, a few works [51,52] propose to adopt an attention mechanism to extract the most valuable part for decision-making. In terms of when to communicate, VBC [49] and TMC [50] utilize a fixed threshold to control the timing of communication. Furthermore, some works about MARL communication focus on offline setting [53], communication robustness for policy deployment [54,55], etc.

2.2 Human-AI coordination

Human-AI Coordination (HAC) considers training AI agents that can effectively coordinate with humans [12], which is particularly important for many real-world applications, such as self-driving vehicles [56] and assistant robots [57]. Thus, studying this problem could potentially make our ultimate goal of building AI systems that can assist humans and augment our capabilities [58,59] and finally enhance the productivity in human society.

Recently, some research has focused on open-world HAC [17], which requires agents to collaborate with unseen partners. Ad-hoc teamwork [18,19] assumes that there is a given population of teams and trains agents to do well when substituted in as one of the members of these teams. The Zero-Shot Coordination (ZSC) task aims to train agents (representing AI) that can coordinate well with novel partners (representing humans). The ZSC problem does not assume that a population of teams is provided but needs to train a population by the ZSC algorithm itself [11,60], thus to expect the trained coordination policy is compatible with unseen partners [61].

Current ZSC methods tend to assume that human partners and agents are homogeneous, which is, however, not aligned with the complexities of many real-world scenarios. Recently, some works have made initial explorations into the aspect of heterogeneous coordination. For example, Hidden-Utility Self-Play (HSP) [23] assumes humans are not reasonable but instead biased, and explicitly models the human biases as hidden reward functions. MAZE [27] considers the heterogeneity of the task and coevolves two populations of agents and partners respectively.

However, there exist other real-world challenges that have been overlooked. For instance, agents may face constraints in their observational capabilities and frequently observe within a distinct context compared to human partners. Nevertheless, much of the existing research operates under the assumption that agents and humans possess equal levels of observance [20,21,23,62], which is overly idealistic and inconsistent with the real-world scenarios.

3 Problem formulation

In this section, we introduce the problem formulation of open and real-world human-AI coordination (ORC). We first give the basic definition of HAC. Then, we introduce the open-world HAC and real-world HAC in Section 3.1 and Section 3.2, respectively. Finally, we introduce our ORCBench, the first benchmark for ORC in Section 3.3.

A general formulation for fully cooperative MARL is Decentralized Partially Observable Markov Decision Process (Dec-POMDP) [63,64], which is denoted as a tuple

⟨ I, S, A, P, Ω, O, R, γ ⟩

, where

I = {1, \dots, N}

indicates the set of

N

controlled agents,

S

is the state space,

A = A^{1} \times \dots \times A^{N}

is the action space,

P : S \times A \to S

is the transition function,

R : S \times A \to R

is a shared global reward function, and

γ \in [0, 1)

is the discount factor. At each time-step, each agent

i

observes a local observation

o_{i} \in Ω

, which is a projection of the true state

s \in S

by the observation function

o_{i} = O (s, i)

. Each agent selects an action

a_{i} \in A

to execute, and all individual actions form a joint action

a \in A^{N}

which leads to the next state

s^{'} \sim P (s^{'} | s, a)

and a shared reward

r = R (s, a)

, the formal objective of the agents is to maximize the expected cumulative discounted reward

E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a a_{t})]

In this work, we specifically model the human-AI coordination as Human-AI-Coordination Dec-POMDP (HAC-Dec-POMDP), denoted by

⟨ I^{{A, H}}, S, A, P, Ω, O, R, γ ⟩

, where

I^{{A, H}}

indicates the set of

N_{A} + N_{H}

players,

A = {1, \dots, N_{A}}

and

H = {1, \dots, N_{H}}

denote the AI and human set, respectively. Besides, agent and human have their own observations and actions, and we denote their corresponding spaces with superscripts. We consider the scenario where there is one controlled agent in each set in this paper, i.e.,

N_{A} = N_{H} = 1

. When there are than one controlled agents in the agent and human sets, without loss of generality, we use

π_{A} : S \to A^{A}

and

π_{H} : S \to A^{H}

to denote the joint policy of agent and human, respectively. Then, we can define the expected discounted return as

J (π_{A}, π_{H}) = E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a_{t}^{(A)}, a_{t}^{(H)})]

, where

a_{t}^{(A)} \sim π_{A} (\cdot | s_{t}), a_{t}^{(H)} \sim π_{H} (\cdot | s_{t}), s_{t + 1} \sim P (\cdot | s_{t}, {a_{t}^{(A)}, a_{t}^{(H)}})

. The goal is to specify

π_{A}

and

π_{H}

to achieve the highest

J (π_{A}, π_{H})

. Let

Π_{H}

denotes the unknown distribution of human policies. The final objective of HAC is to obtain the best AI agent policy

π_{A}^{*}

with the goal of maximizing the expected return with the true partner distribution, i.e.,

π_{A}^{*} =

\arg max_{π_{A}} E_{π_{H} \sim Π_{H}^{*}} [J (π_{A}, π_{H})]

3.1 Open-world coordination

Open-world human-AI coordination (OC) aims to obtain an agent that can coordinate well with unseen partners, which is a fundamental challenge in HAC and cooperative MARL. In our considered two-player setting, there are two roles for the human and the agent, denoted as 1 and 2, respectively, using superscripts. One simple approach for OC is Self-Play (SP) [65,66], with the following objective:

\begin{array}{l} J_{S P} (π_{A}) := J (π_{A}^{1}, π_{A}^{2}) + J (π_{A}^{2}, π_{A}^{1}) . \end{array}

However, the SP agent is only paired with a copy of itself, making it hard to generalize well to novel partners [11,12]. Recent approaches for OC are exposing the agent to diverse partners during the training process [20–22,27,67,68]. OC methods first construct a human partner population

Π_{H}^{'}

to approximate the true distribution

Π_{H}^{*}

. To achieve this, one successful technique is improving diversity in the human partner population [20–22]. After obtaining the human partner population, OC methods train a best response agent to them and use it as the solution to OC.

To evaluate the different OC methods, Cross-play (XP) [11] metric is usually considered, i.e.,

\begin{array}{l} J_{X P} (π_{A}, π_{H}) := J (π_{A}^{1}, π_{H}^{2}) + J (π_{H}^{1}, π_{A}^{2}) . \end{array}

In the two-player cooperative task, the XP metric necessitates both the agent and the human to perform well in roles 1 and 2 simultaneously, which is, however, not realistic. In the real world, the roles and responsibilities of the agent and the human are often naturally different and fixed. In other words, the agent and the human are heterogeneous, and different policies should be employed to control different roles. In heterogeneous scenarios, conventional OC methods tend to be less efficient [27] because it is challenging or even impossible to train a diverse set of partners individually, but rather requires simultaneous and coordinated training of the agent and the partners. This challenge reflects the nature of real-world coordination, which we will discuss in Section 3.2.

3.2 Real-world coordination

We begin with two instances of real-world human-AI coordination (RC). In the factory production task, the robotic arm is capable of lifting heavy objects that human workers are unable to handle, while human workers can perform more flexible operations compared to the robotic arm. This highlights the distinct tasks performed by the agent and the human, stemming from their inherent heterogeneity. Furthermore, in many real-world scenarios, the agent and the human have distinct observation conditions due to differences in how they receive information. Humans possess superior information aggregation capabilities and have access to multiple channels for gathering information. It is essential for humans to transmit relevant and useful information to the agent in order to achieve better coordination.

Thus, RC has two significant features: 1) Due to the heterogeneous action space, human partners and AI agents should have natural and fixed roles. 2) Due to the heterogeneous observation space, they need to communicate with each other to mitigate the effects of partial observability. Therefore, in the formulation of RC, it is crucial not only to explicitly define the differing abilities of different roles but also to introduce communication mechanisms to enhance coordination capabilities.

As we discussed above, we assume that the agent and the human take on roles 1 and 2 respectively in real-world coordination tasks. Besides, the two players are usually heterogeneous, i.e., their observation space

O^{A}

and

O^{H}

, and action space

A^{A}

and

A^{H}

are different [26,27,69,70]. Then, we have the following ORC metric, which takes both open coordination and real coordination into account:

\begin{array}{l} J_{O R C} (π_{A}, π_{H}) := J (π_{A}^{1}, π_{H}^{2}), \end{array}

and the objective of ORC algorithms is to obtain the optimal agent policy

π_{A}^{*} = \arg max_{π_{A}} E_{π_{H} \sim Π_{H}^{*}} [J_{O R C} (π_{A}, π_{H})]

. When introducing a message set

M

to model the communication in RC, the HAC-Dec-POMDP can be transformed into a HAC-Dec-POMDP with Communication (Dec-POMDP-Comm) [63,64]. Specifically, an AI agent makes decision based on an individual communication policy

π^{C} (a_{i} ∣ τ_{i}, m_{i})

, where

τ_{i}

represents the history

(o_{i}^{1}, a_{i}^{1}, \dots, o_{i}^{t - 1}, a_{i}^{t - 1}, o_{i}^{t})

m_{i} \in M

is the message generated by the human partner.

In our paper, we assume fixed roles where the agent serves as the recipient of information, while the human acts as the sender of information. This is a common scenario in the real world, where humans often possess a richer set of observational information compared to robots. By providing the necessary information to the robot, it is possible to achieve better coordination.

3.3 ORC benchmark

To facilitate further ORC research, we propose the first benchmark for ORC, i.e., ORCBench. In this section, we will first introduce the environments and interfaces. This includes how to make the environment open (unseen pre-trained partners) and how to make the environment real (heterogeneous agents and partners). Besides, ORCBench can also be used to assess the performance of communication algorithms in human-machine collaboration. Existing communication algorithms rarely assume that partners may change, and we hope that our provided benchmark can help researchers address this issue.

Environments Overcooked [12] is a two-player common-payoff coordination cooking environment, which is one of the most popular benchmarks in human-AI coordination [20,23,27,62,71,72]. As shown in Fig.1, two players are placed into a grid-world kitchen as chefs and tasked with delivering as many cooked dishes of onion soup as possible within a limited time budget. This task involves a series of sequential high-level actions to which both players can contribute: collecting onions, depositing them into cooking pots, letting the onions cook into soup, collecting a dish, getting the soup, and delivering it. Both players are rewarded equally when delivering a soup successfully. As shown in Fig.2, we include the following layouts with different degrees of heterogeneity in our benchmark:

Fig.1 Illustration of the Overcooked game

Full size|PPT slide

Fig.2 Illustration of different layouts on Overcooked. (a) CR and H-CR; (b) AA; (c) AA-2; (d) FC

Full size|PPT slide

● Crapped room (CR) is a simple and homogeneous layout, presenting a low-level coordination challenge.

● Heterogeneous-crapped room (H-CR) is modified from CR by constraining the skills of the agent and partner, forcing them to complete the task cooperatively. Specifically, player 1 can only collect onions, deposit them into cooking pots, but can not collect a dish, get the soup, or deliver the soup at the delivery location.

● The agent and partner in asymmetric advantages (AA) have the same skills but with different advantages.

● Asymmetric advantages-2 (AA-2) is modified from AA by making the different advantages more obvious. The distance between the onions and the cooking pots in AA-2 is longer compared to AA. Thus, player 1 will take more time to get the onions and deposit them into cooking pots.

● In Forced Coordination (FC), player 1 on the left can only collect and deliver the onions and dishes, while player 2 on the right is only to receive the onions, put them into the cooking pots, and then use the dish to deliver the soup. If they want to deliver the soup successfully, they have to collaborate with each other.

To further validate the effectiveness and scalability of the ORC method and enrich the ORCBench, we also design a environment named Emergency Rescue, which is a cooperative common-payoff grid-world based environment. As shown in Fig.3, human doctors and AI robots must effectively collaborate to search for and locate injured person in a forest for medical treatment. When a certain number of treatments are provided (i.e., when some humans or AI robots reach the vicinity of the injured person), it is considered a success, and both humans and agents will receive a high reward. Unlike Overcooked, Emergency Rescue allows for multiple agents, making it an appropriate environment to test the scalability of various ORC algorithms. Additionally, we consider different masks (i.e., scale of the sights) of the environments, where smaller sights bring more challenges for the coordination between humans and AI robots.

Fig.3 Illustration of the emergency rescue environment

Full size|PPT slide

Observation

● The observations of agent and partners in Overcooked are a tensor with 10 channels. Specifically, the first five channels correspond to the basic information of the layout: the location of the pot, the location of the counter, the location of the onion, the location of the dish, the location for serving; the last five channels correspond to the useful information for better cooking: the quantity of onions in the pot, the cooking time for the onions, the location of the onion soup, the number of available dishes, and the number of available onions. We obtain different POMDP environments by masking some specific channels of the agent. A detailed observations information and the masks are provided in Tab.1.

● In Emergency Rescue, the observation sight of the AI robots is restricted due to limitations in sensor performance and energy consumption. However, humans have the advantage of carrying superior sensing devices, thereby enabling them to have a broader observation range.

Capabilities One significant differences between human and AI is the capability.

● In Overcooked, the capabilities of AI and humans are different on some layouts. For example, players in different rooms (i.e., left and right) of layout AA (Fig.2(b)) have different advantages. Player 1 on the left is good at collecting dishes and delivering the soup, while player 2 on the right is good at collecting onions and deposing them into cooking pots. If they can coordinate well, high scores can be achieved.

● In Emergency Rescue, human doctors have slower speed in the forest, whereas the AI robots (e.g., drones and quadruped robots) exhibit faster mobility. In the modeling of MDP (Markov Decision Process), the difference in action frequencies [73] between humans and AI robots can be considered. Specifically, in our experiment, AI robots are capable of taking three steps while humans are limited to taking only one step.

In ORCBench, in addition to the standard MARL APIs, we also provide the following interfaces to enhance user usability for environment heterogeneity. Note that the pre-trained partners and pre-trained communication modules are obtained after running our HeteC from scratch. They are seen as “pre-trained” from the perspective of the users of ORCBench.

Pre-trained partners To facilitate users in quickly testing their own ORC algorithms, we provide a series of pre-trained partners, which are categorized based on their performance levels, ranging from low (i.e., SP partners), medium (i.e., MEP partners) to high (i.e., MAZE partners).

Pre-trained communication modules Communication plays a crucial role in MARL, especially under the POMDP setting, leading to a better understanding of other agents and environmental information, thus better coordination abilities. To facilitate user convenience, we provide several pre-trained communication modules with different input information for each environment, which can be easily integrated into any existing ZSC algorithm.

4 Method

In this section, we introduce our method for open real-world coordination, Heterogeneous training with Communication (HeteC). We first introduce the general framework. Then, we introduce the detailed heterogeneous training framework and communication module in Section 4.1 and Section 4.2, respectively.

We consider the two-player ORC problem stated in Section 3, where the trained agent should coordinate well with the unseen partner, and the agent and partner are heterogeneous, i.e., they have different observations and capabilities in different environments. HeteC utilizes a heterogeneous training framework [27], which not only exposes the agent to diverse partners during the training process, but can improve the ability to coordinate with unseen partners [20–22], i.e., handling open-world coordination challenge, but also can simultaneously train the agent population and partner population from scratch, handling the real-world coordination challenge. To further enhance partner diversity, HeteC adopts different approaches to update the partner, i.e., mixed partner training, and fixes some partners during the training process, i.e., frozen archive. Additionally, HeteC introduces a communication module to facilitate better coordination between human partners and AI agents in situations where there is an information disparity.

The detailed procedure of HeteC is presented in Algorithm 1. HeteC firstly randomly initializes two populations of agents and partners (line 1) and coevolves them by iterative pairing, updating, and selection (lines 3–13). In each generation, the agents and partners from the two populations are first paired to get the agent-partner pairs (lines 4–5). Then, each agent-partner pair interacts with the environment to collect their trajectories (line 7), which are used to update the agent and partner (line 8). In the selection process (lines 12–13), the updated agents directly form the agent population in the next generation, while the updated partners will be added to an archive, which contains diverse partners generated so far. The next partner population is generated by selecting diverse partners from the archive, A high-level workflow of HeteC is provided in Fig.4 for clarity.

Fig.4 Illustration of the high-level workflow of HeteC

Full size|PPT slide

4.1 Heterogeneous training framework

One challenge of ORC is to train the agent can coordinate well with unseen heterogeneous human partners. To address this challenge, ORC utilizes a heterogeneous training framework from MAZE [27] based on evolutionary algorithms [74,75], which has shown its excellent performance when handling the heterogeneous challenges by maintaining two separate agent and partner populations and coevolve them. In the updating process, HeteC uses Jensen-Shannon Divergence (JSD) as the diversity term to enhance the objective function, which is similar to the approach in [20,27]. Then, HeteC updates agents and partners by MAPPO [9], and updates the agent population and partner archive following the approach in MAZE. In the selection process, HeteC uses clustering-based selection [76] to select a set of high-performance but diverse partners to form a subset of the next partner population. After finishing the training process, HeteC returns the AI agent by ensemble the agent population through majority voting, which selects the action voted by the most agents when given a state.

However, relying solely on update and selection mechanisms is insufficient to promote partner diversity. In the case of MAZE, it quickly converges during training, resulting in limited ability to collaborate with diverse partners and thus limiting its overall coordination abilities. In the following part, we will introduce the mechanisms in HeteC that are designed to further improve the performance.

Mixed partners training The key to achieving ORC is to provide a high-quality population of partners. Existing research has shown that different algorithms yield partners that can reach different implicit conventions and converge to different Nash equilibria in cooperative games. HeteC allows to use other methods to update the partners to enhance the diversity of partners. Specifically, before the regular updating by MAPPO, the partners will first use SP to update themselves, which can further increase the differences between them.

Frozen archive The heterogeneous training framework considers the heterogeneity between agents and partners, thus the algorithm converges quickly (as shown in Fig.3 of [27]), resulting in a lack of diversity in the partner population. To address this, we periodically and randomly select and freeze some partners and store them in a special archive called the Frozen Archive. Once a partner is added to the frozen archive, it is only used for pairing with the agent and will be not updated anymore, i.e., directly returned to the frozen archive without updating. During the selection process, there is a probability of

α

choosing a partner from the frozen archive as a part of the next generation’s partner population. We analyze the sensitivity of this hyperparameter in the ablation studies, as shown in Section 5.4.

4.2 Communication module

To address the challenges arising from the disparity in observations between the AI agent and human partners, HeteC integrates a communication module into the framework, which enables human partners to transmit the information they perceive to the AI agent. As shown in Fig.5, we utilize a simple communication module by directly concatenating the information to the second-to-last embedding of the agent policy. Although it is transferable, the communication module is considered part of the human partner’s policy

π^{H}

and is updated end-to-end accordingly. Note that other advanced communication methods can also be employed to enhance the performance. Furthermore, we examine the impact of various input information on the communication module and the transferability of the communication module in Section 5.4.

Fig.5 Illustration of the policy network and communication module

Full size|PPT slide

Finally, we provide an example of HeteC in action on the FC layout of Overcooked. As depicted in Fig.6, the AI and agents are situated in separate rooms, each with distinct capabilities. For instance, the human on the left can only collect and deliver onions and dishes, while the AI on the right can solely receive the onions, place them in cooking pots, and deliver the soup using the dish. Effective coordination is essential if they wish to deliver a complete soup. Furthermore, their observations differ, with the human having access to all game information, while the AI lacks knowledge of the cooking pot’s location, making it challenging to fulfill cooking tasks. Thanks to the communication module in HeteC, the human can transmit valuable information (

m

) to enhance the AI agent’s observation, such as providing the position of the cooking pots. The effectiveness of HeteC will be demonstrated through various experiments in Section 5.

Fig.6 Illustration of the HeteC on FC layout of Overcooked

Full size|PPT slide

5 Experiments

In this section, we design experiments on several environments from ORCBench to evaluate the proposed method. We investigate the following research questions (RQs): (1) Is the ORC challenging and why (Section 5.2)? (2) What is the performance of HeteC on the ORC environments (Section 5.3)? (3) Are the mechanisms of HeteC important (Section 5.4)? (4) Can HeteC agent coordinate well with real humans (Section 5.5)?

5.1 Experimental settings

We use five layouts of Overcooked, i.e., CR, H-CR, AA, AA-2, and FC as our test beds. As stated in Section 3.3, the observation of Overcooked has a total of 10 channels, where different channel includes different information. In the experiment, we create different partially observable environments by masking different channels. Note that there are multiple ways to mask channels, and we randomly use the following three masks, as shown in Tab.1. For Emergency Rescue, we consider the three layouts, i.e., 1-1, 1-2, and 1-3, where

x

y

denote the there are

x

humans and

y

agents in the environment. We also use the Mask to these layouts, where We also use masks to set up these environments, where Mask1, Mask2, and Mask3 have progressively smaller sight of agent, making the environment increasingly challenging.

Tab.1 Observation channels and Masks in Overcooked environment

Channel	Type	Information	Mask 1	Mask 2	Mask 3
0	Basic Information	Location of the Cooking Pots	$\sqrt$	$\sqrt$	$\sqrt$
1		Location of the counters	$\sqrt$	$\sqrt$	$\times$
2		Location of the onions	$\sqrt$	$\times$	$\sqrt$
3		Location of the dishes	$\sqrt$	$\times$	$\times$
4		Location for delivery	$\sqrt$	$\sqrt$	$\sqrt$
5	Advanced Information	Number of onions in the pot	$\times$	$\times$	$\times$
6		Cooking time for the onions	$\times$	$\times$	$\sqrt$
7		Location of onion soup	$\times$	$\times$	$\times$
8		Number of available dishes	$\times$	$\sqrt$	$\sqrt$
9		Number of available onions	$\times$	$\sqrt$	$\times$

We compare the proposed method with the following methods:

● Population play (PP) [67] maintains a population of agents that interact with each other. During each PP iteration, pairs of agents are drawn from the population and trained for several timesteps.

● Fictitious Co-Play (FCP) [21] is a two-stage approach to learning to collaborate with humans without human data. In the first stage, it builds a pool of partners that represent different conventions; while at the second stage, it trains a best-response agent to obtain diverse partners and their checkpoints.

● Maximum Entropy Population-based training (MEP) [22] also follows a two-stage framework, while it proposes to learn a diverse partner population through maximizing one centralized population entropy objective.

● Hidden-utility Self-Play (HSP) [23] explicitly models the human biases as hidden reward functions. On this basis, it augments the policy pool with biased policies and afterward trains an adaptive policy.

● Multi-agent zero-shot coordination by coevolution (MAZE) [27] separately maintains agent and partner population and coevolves them, demonstrating excellent performance in heterogeneous environments.

We use the following partners to evaluate various comparison methods, which simulate humans with varying skill levels ranging from low to high: 1) SP partners, which are trained by SP [65,66] and used to simulate poorly-performed humans; 2) MEP partners [22], which are the partners at the first phase and are used to simulate moderately-performed humans; 3) MAZE partners [27], which are obtained by exchanging the role of agent and partner and re-training in another separate run. They are used to simulate highly-performed humans.

We implement all the methods using MAPPO [9] as the basic RL algorithm. For a fair comparison, our Hetec uses the same settings for the same mechanisms as above mentioned methods, such as hyper-parameters settings of MAPPO, and the number of training episodes (as all the methods). We will report the average results across three identical seeds for all algorithms and all tasks. Detailed settings of different methods are provided in Appendix A.1.

5.2 RQ1: Challenge of ORC

In order to verify the impact of ORC settings on the performance of different algorithms, we first compare a series of methods under MDP and POMDP settings and test them with different partners. We randomly selected Mask 2 as the ORC POMDP environment. Each algorithm is trained on each type of environment. We further compute the ranking of each algorithm under each setting as in [77], which are averaged in the last row of Tab.2.

Tab.2 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on CR and AA layouts. For each combination of layout and partner, the largest reward is bolded

Environment	Type	Partner	PP	FCP	MEP	HSP	MAZE
CR	MDP	SP	139.5 $\pm$ 12.98	174.5 $\pm$ 19.37	153.625 $\pm$ 27.33	161.25 $\pm$ 35.71	178.25 $\pm$ 17.80
		MEP	163.0 $\pm$ 4.79	183.375 $\pm$ 10.61	184.25 $\pm$ 14.41	176.875 $\pm$ 12.89	186.75 $\pm$ 13.54
		MAZE	215.875 $\pm$ 13.71	213.5 $\pm$ 17.50	205.375 $\pm$ 12.13	192.375 $\pm$ 34.57	217.0 $\pm$ 10.89
AA		SP	159.75 $\pm$ 28.51	184.0 $\pm$ 29.57	143.875 $\pm$ 21.47	121.75 $\pm$ 20.64	183.375 $\pm$ 23.73
		MEP	243.5 $\pm$ 15.98	264.15 $\pm$ 15.87	250.375 $\pm$ 18.39	207.75 $\pm$ 20.15	269.625 $\pm$ 22.31
		MAZE	285.25 $\pm$ 10.36	331.375 $\pm$ 11.61	318.0 $\pm$ 10.41	314.625 $\pm$ 9.59	334.125 $\pm$ 19.29
Average ranking			4.0	2.17	3.3	4.33	1.17
CR	POMDP	SP	81.375 $\pm$ 33.75	58.125 $\pm$ 13.09	89.125 $\pm$ 41.98	47.875 $\pm$ 4.85	112.5 $\pm$ 26.83
		MEP	55.0 $\pm$ 38.03	65.0 $\pm$ 26.07	72.25 $\pm$ 14.28	32.875 $\pm$ 12.84	76.0 $\pm$ 15.02
		MAZE	44.625 $\pm$ 14.34	82.25 $\pm$ 18.26	62.25 $\pm$ 17.88	54.275 $\pm$ 14.16	88.25 $\pm$ 21.27
AA		SP	93.75 $\pm$ 10.76	43.875 $\pm$ 6.26	67.75 $\pm$ 21.03	95.375 $\pm$ 23.76	113.51 $\pm$ 31.52
		MEP	111.125 $\pm$ 16.01	51.51 $\pm$ 6.56	82.01 $\pm$ 26.09	102.625 $\pm$ 40.95	144.875 $\pm$ 18.23
		MAZE	96.5 $\pm$ 15.64	54.125 $\pm$ 5.95	93.75 $\pm$ 15.44	115.5 $\pm$ 12.18	169.375 $\pm$ 10.62
Average ranking			3.33	4.0	3.17	3.5	1.0

All the methods show a decreased performance in POMDP environments, indicating the challenges posed by ORC. MAZE performs the best, achieving the optimal average ranking in both types of environments, i.e., 1.17 and 1.0, respectively. Considering the heterogeneous nature between agent and partner, the advantage of MAZE in POMDP becomes more prominent (average ranking improves from 1.7 to 1.0). Thus, MAZE will be an important compared method in the subsequent experiments.

5.3 RQ2: Performance of HeteC

To investigate the performance of HeteC, we first compare HeteC with MAZE on all the five layouts of Overcooked with different Masks. As shown in Tab.3, there is not much difference between MAZE and HeteC in fully observable environments, with HeteC performing slightly better, especially in the more challenging environments except for CR. In the three masked environments, HeteC demonstrates significant advantages and achieves the best performance across all of them than MAZE. We also compare HeteC with MAZE on the 1-1, 1-2, and 1-3 layouts of Emergency Rescue, as shown in Tab.4. In the simpler environment (i.e., Fully Observable and Mask 1), MAZE and HeteC exhibit similar performance. However, as the agent’s sight decreases, HeteC demonstrates a more pronounced advantage. As the number of controlled agents increases, the environment becomes more challenging, but our algorithm can still perform well, demonstrating its scalability. Note that the higher reward in the 1-3 environment compared to 1-2 is because we consider at least two agents reaching the goal in these two environments. Since 1-3 has a higher total number of agents, it is relatively easier in comparison.

Tab.3 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on CR, H-CR, AA, AA-2, and FC layouts of Overcooked. For each combination of layout and partner, the largest reward is bolded

Environment	Partner	Fully observable		Mask 1		Mask 2		Mask 3
Environment	Partner	MAZE	HeteC	MAZE	HeteC	MAZE	HeteC	MAZE	HeteC
CR	SP	178.25 $\pm$ 17.80	160.375 $\pm$ 16.01	66.5 $\pm$ 12.72	150.5 $\pm$ 21.01	112.5 $\pm$ 26.83	116.625 $\pm$ 11.31	64.5 $\pm$ 30.68	130.5 $\pm$ 9.06
	MEP	186.75 $\pm$ 13.54	173.125 $\pm$ 10.71	80.5 $\pm$ 34.47	182.5 $\pm$ 16.37	76.0 $\pm$ 15.02	156.99 $\pm$ 29.68	42.75 $\pm$ 2.11	170.125 $\pm$ 29.03
	MAZE	217.0 $\pm$ 10.89	201.5 $\pm$ 17.02	78.25 $\pm$ 29.66	205.875 $\pm$ 8.45	88.25 $\pm$ 21.27	206.375 $\pm$ 17.63	71.875 $\pm$ 25.59	196.375 $\pm$ 13.24
H-CR	SP	127.25 $\pm$ 16.41	140.75 $\pm$ 17.25	30.625 $\pm$ 21.01	175.375 $\pm$ 41.31	142.75 $\pm$ 58.03	156.375 $\pm$ 23.91	14.7 $\pm$ 7.92	139.125 $\pm$ 35.62
	MEP	211.75 $\pm$ 8.97	208.625 $\pm$ 9.73	6.0 $\pm$ 3.45	193.6 $\pm$ 48.16	213.625 $\pm$ 13.18	221.75 $\pm$ 4.95	10.275 $\pm$ 5.77	209.25 $\pm$ 12.43
	MAZE	213.925 $\pm$ 14.61	219.875 $\pm$ 10.02	5.5 $\pm$ 2.94	218.75 $\pm$ 12.09	214.375 $\pm$ 5.74	227.5 $\pm$ 6.92	54.23 $\pm$ 46.21	205.5 $\pm$ 10.01
AA	SP	183.375 $\pm$ 23.73	160.375 $\pm$ 17.37	18.675 $\pm$ 29.15	141.5 $\pm$ 29.26	113.51 $\pm$ 31.52	169.5 $\pm$ 32.78	38.0 $\pm$ 57.31	148.625 $\pm$ 26.38
	MEP	269.625 $\pm$ 22.31	257.5 $\pm$ 29.61	31.375 $\pm$ 40.05	248.375 $\pm$ 44.05	144.875 $\pm$ 18.23	271.85 $\pm$ 15.55	76.125 $\pm$ 74.53	235.75 $\pm$ 17.94
	MAZE	334.125 $\pm$ 19.29	365.0 $\pm$ 6.09	26.3 $\pm$ 8.01	360.25 $\pm$ 7.89	169.375 $\pm$ 10.62	356.625 $\pm$ 8.83	157.625 $\pm$ 30.29	347.5 $\pm$ 10.07
AA-2	SP	128.875 $\pm$ 28.83	117.25 $\pm$ 37.81	65.375 $\pm$ 41.07	84.3 $\pm$ 53.83	90.875 $\pm$ 6.39	97.125 $\pm$ 57.22	91.125 $\pm$ 32.65	111.0 $\pm$ 16.32
	MEP	190.5 $\pm$ 13.66	199.75 $\pm$ 49.32	23.5 $\pm$ 7.23	217.0 $\pm$ 29.81	97.375 $\pm$ 17.11	231.625 $\pm$ 19.97	142.25 $\pm$ 20.94	200.6 $\pm$ 14.36
	MAZE	243.75 $\pm$ 16.72	272.0 $\pm$ 16.0	26.375 $\pm$ 4.95	245.625 $\pm$ 9.77	93.5 $\pm$ 52.84	252.375 $\pm$ 8.83	225.625 $\pm$ 11.42	260.0 $\pm$ 2.21
FC	SP	99.0 $\pm$ 16.25	101.625 $\pm$ 15.87	4.625 $\pm$ 5.65	99.5 $\pm$ 9.44	5.2 $\pm$ 6.31	110.4 $\pm$ 8.69	8.025 $\pm$ 5.24	108.5 $\pm$ 20.68
	MEP	116.0 $\pm$ 12.54	129.3 $\pm$ 33.01	5.225 $\pm$ 3.43	108.125 $\pm$ 32.265	5.625 $\pm$ 5.28	110.75 $\pm$ 29.97	3.475 $\pm$ 2.19	107.375 $\pm$ 18.18
	MAZE	173.0 $\pm$ 13.62	185.875 $\pm$ 5.45	3.7 $\pm$ 2.31	120.375 $\pm$ 35.85	5.0 $\pm$ 2.59	185.5 $\pm$ 9.15	24.775 $\pm$ 29.38	181.375 $\pm$ 2.21
Average Ranking		1.6	1.4	2	1	2	1	2	1

Tab.4 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on 1-1, 1-2, 1-3 layouts of Emergency Rescue. For each combination of layout and partner, the largest reward is bolded

Environment	Partner	Fully observable		Mask 1		Mask 2		Mask 3
Environment	Partner	MAZE	HeteC	MAZE	HeteC	MAZE	HeteC	MAZE	HeteC
1-1	SP	0.70 $\pm$ 0.04	0.72 $\pm$ 0.01	0.73 $\pm$ 0.01	0.73 $\pm$ 0.02	0.69 $\pm$ 0.02	0.70 $\pm$ 0.03	0.51 $\pm$ 0.01	0.73 $\pm$ 0.01
	MEP	0.71 $\pm$ 0.05	0.73 $\pm$ 0.01	0.75 $\pm$ 0.02	0.74 $\pm$ 0.02	0.69 $\pm$ 0.01	0.70 $\pm$ 0.03	0.53 $\pm$ 0.03	0.74 $\pm$ 0.02
	MAZE	0.76 $\pm$ 0.04	0.77 $\pm$ 0.02	0.76 $\pm$ 0.02	0.76 $\pm$ 0.02	0.72 $\pm$ 0.02	0.74 $\pm$ 0.03	0.58 $\pm$ 0.02	0.78 $\pm$ 0.01
1-2	SP	0.49 $\pm$ 0.05	0.50 $\pm$ 0.01	0.48 $\pm$ 0.01	0.47 $\pm$ 0.05	0.33 $\pm$ 0.09	0.48 $\pm$ 0.03	0.31 $\pm$ 0.05	0.47 $\pm$ 0.04
	MEP	0.52 $\pm$ 0.05	0.52 $\pm$ 0.02	0.51 $\pm$ 0.03	0.49 $\pm$ 0.06	0.42 $\pm$ 0.08	0.56 $\pm$ 0.03	0.31 $\pm$ 0.04	0.48 $\pm$ 0.03
	MAZE	0.72 $\pm$ 0.04	0.72 $\pm$ 0.01	0.70 $\pm$ 0.02	0.68 $\pm$ 0.05	0.57 $\pm$ 0.09	0.70 $\pm$ 0.06	0.48 $\pm$ 0.03	0.63 $ƒ \pm$ 0.02
1-3	SP	0.64 $\pm$ 0.02	0.63 $\pm$ 0.01	0.56 $\pm$ 0.03	0.60 $\pm$ 0.03	0.38 $\pm$ 0.07	0.52 $\pm$ 0.08	0.42 $\pm$ 0.04	0.60 $\pm$ 0.07
	MEP	0.65 $\pm$ 0.02	0.64 $\pm$ 0.01	0.57 $\pm$ 0.03	0.60 $\pm$ 0.03	0.37 $\pm$ 0.07	0.51 $\pm$ 0.08	0.44 $\pm$ 0.05	0.61 $\pm$ 0.07
	MAZE	0.71 $\pm$ 0.02	0.69 $\pm$ 0.01	0.64 $\pm$ 0.03	0.67 $\pm$ 0.03	0.46 $\pm$ 0.09	0.60 $\pm$ 0.08	0.50 $\pm$ 0.06	0.66 $\pm$ 0.07
Average ranking		1.44	1.33	1.33	1.44	2	1	2	1

Then, We plot the training curves of MAZE and HeteC on AA layout of Overcooked and 1-2 layout of Emergency Rescue under different settings, as shown in Fig.7. MAZE has a significant performance decline in the masked environments. However, the masked environment has a much smaller impact on HeteC. A similar conclusion can be obtained on the CR layout, which is provided in Appendix A.2.

Fig.7 Training curves of MAZE and HeteC on fully observable and different masks of the AA and 1-2 layouts. (a) MAZE-AA; (b) HeteC-AA; (c) MAZE-1-2; (d) HeteC-1-2

Full size|PPT slide

Furthermore, we compared the performance variation of all algorithms in fully observable and Mask2 environments on CR and AA layouts in Fig.8. It can be observed that HeteC exhibits significant advantages in both cases.

Fig.8 Performance comparison of different methods on CR and AA layouts when testing with different partners. The gray shaded bar is used to denote the degree of performance variation, which equals to the highest performance on the two settings minus the performance on fully observable environment. (a) CR; (b) AA

Full size|PPT slide

5.4 RQ3: Further studies

Ablation studies In order to verify the impact of different components of our algorithm on its performance, we conducted ablation experiments on different masks of the AA layout. We consider the following ablation methods: MAZE, MAZE with a communication module (i.e., MAZE + Comm.), HeteC without mixed partner training (i.e., HeteC w/o MPT), HeteC without frozen archive (i.e., HeteC w/o FA), HeteC without clustering-based selection and use random selection (i.e., HeteC w/o CS), and HeteC.

As shown in Tab.5, HeteC has the best average ranking, i.e., 1.42, among all the compared methods. The use of the communication module in the maze (i.e., MAZE + Comm.) significantly improves the performance in multiple POMDP environments. HeteC w/o MPT and HeteC w/o FA show similar overall performance, i.e., 3.12 and 3.00, respectively. HeteC w/o CS has a significant performance decline. Clustering-based selection is a key basic mechanism in diversity-based optimization, whose effectiveness is also verified in some previous studies [76]. The experimental results demonstrate the effectiveness of different components of HeteC: communication significantly enhances performance in POMDP environments, and the use of our proposed mixed partners training and frozen archive mechanisms further improves performance in ORC scenarios.

Tab.5 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on the AA layout. The POMDP environment is Mask 2. For each combination of layout and partner, the largest reward is bolded

Environment	Partner	MAZE	MAZE + Comm.	HeteC w/o MPT	HeteC w/o FA	HeteC w/o CS	HeteC
Fully observable	SP	139.375 $\pm$ 12.64	143.0 $\pm$ 30.61	144.875 $\pm$ 39.63	135.75 $\pm$ 6.81	134.625 $\pm$ 20.51	140.375 $\pm$ 20.61
	MEP	256.625 $\pm$ 30.41	247.525 $\pm$ 37.61	242.875 $\pm$ 33.05	242.125 $\pm$ 25.70	256.25 $\pm$ 18.24	257.5 $\pm$ 29.61
	MAZE	335.0 $\pm$ 11.83	350.125 $\pm$ 16.39	358.625 $\pm$ 10.29	343.375 $\pm$ 2.21	304.5 $\pm$ 13.08	365.0 $\pm$ 6.09
Mask 1	SP	3.6 $\pm$ 2.36	129.125 $\pm$ 32.86	133.25 $\pm$ 32.83	145.5 $\pm$ 37.17	134.625 $\pm$ 26.92	141.5 $\pm$ 29.26
	MEP	5.725 $\pm$ 2.31	244.0 $\pm$ 48.67	244.625 $\pm$ 25.53	244.375 $\pm$ 37.29	233.125 $\pm$ 36.11	248.375 $\pm$ 44.05
	MAZE	6.94 $\pm$ 1.05	329.9 $\pm$ 6.87	353.125 $\pm$ 10.26	357.0 $\pm$ 8.86	328.875 $\pm$ 17.63	360.25 $\pm$ 7.89
Mask 2	SP	99.625 $\pm$ 25.07	136.0 $\pm$ 18.06	131.1 $\pm$ 13.85	147.1 $\pm$ 27.33	135.02 $\pm$ 30.41	149.375 $\pm$ 33.73
	MEP	93.5 $\pm$ 15.91	244.25 $\pm$ 23.93	256.875 $\pm$ 29.65	248.5 $\pm$ 16.57	274.5 $\pm$ 46.99	271.85 $\pm$ 15.55
	MAZE	85.75 $\pm$ 15.81	341.875 $\pm$ 12.04	348.375 $\pm$ 14.23	337.375 $\pm$ 19.76	353.125 $\pm$ 12.19	356.375 $\pm$ 8.83
Mask 3	SP	82.5 $\pm$ 52.17	137.125 $\pm$ 32.01	141.25 $\pm$ 20.43	143.75 $\pm$ 11.36	139.75 $\pm$ 28.67	148.625 $\pm$ 26.38
	MEP	129.0 $\pm$ 28.06	226.75 $\pm$ 40.99	229.875 $\pm$ 31.31	250.375 $\pm$ 25.44	207.5 $\pm$ 59.36	235.75 $\pm$ 17.94
	MAZE	132.25 $\pm$ 53.58	341.375 $\pm$ 13.33	341.25 $\pm$ 8.12	350.5 $\pm$ 9.95	326.73 $\pm$ 35.55	347.5 $\pm$ 10.07
Average ranking		5.42	3.83	3.12	3.00	4.08	1.50

Sensitivity analysis of frozen ratio $α$ . We investigate the influence of frozen ratio

α

of sampling partners from the frozen archive. In each generation, HeteC will select

N \cdot α

partners from the frozen archive

A r c^{'}

and the remaining

N \cdot (1 - α)

partners from the archive

A r c

. As shown in Fig.9,

α

value that are too high or too low can have a negative impact on algorithm performance. We select a moderate value of 0.6.

Fig.9 Sensitivity analysis on frozen ratio $α$ on AA layout using Mask 2. 0.6 is used in our main experiments

Full size|PPT slide

Influence of the input information of communication module In our experiments, we typically utilize the full observation information of the human partner as the default input for the communication module. However, we want to investigate the impact of using only specific masked channels as the input, which we refer to as HeteC-specific.

As shown in Tab.6, using full information has the best average ranking, which is slightly better than using specific information. Note that the results of HeteC-specific on Mask 3 are better than HeteC’s. This may be because the missing information from Mask 3 has a significant impact on the results. Using specific information allows the agent to pay more attention to this information, resulting in less interference compared to using all the information.

Tab.6 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on the AA layout. For each combination of layout and partner, the largest reward is bolded

Environment	Partner	MAZE	MAZE + Comm.	HeteC	HeteC-specific
Mask 1	SP	3.6 $\pm$ 2.36	129.125 $\pm$ 32.86	141.5 $\pm$ 29.26	135.0 $\pm$ 16.28
	MEP	5.725 $\pm$ 2.31	244.0 $\pm$ 48.67	248.375 $\pm$ 44.05	246.125 $\pm$ 13.31
	MAZE	6.94 $\pm$ 1.05	329.9 $\pm$ 6.87	360.25 $\pm$ 7.89	349.125 $\pm$ 10.47
Mask 2	SP	99.625 $\pm$ 25.07	136.0 $\pm$ 18.06	169.5 $\pm$ 32.78	164.875 $\pm$ 37.88
	MEP	93.5 $\pm$ 15.91	244.25 $\pm$ 23.93	271.85 $\pm$ 15.55	255.875 $\pm$ 36.22
	MAZE	85.75 $\pm$ 15.81	341.875 $\pm$ 12.04	356.375 $\pm$ 8.83	353.0 $\pm$ 3.41
Mask 3	SP	82.5 $\pm$ 52.17	137.125 $\pm$ 32.01	148.625 $\pm$ 26.38	154.25 $\pm$ 29.71
	MEP	129.0 $\pm$ 28.06	226.75 $\pm$ 40.99	235.75 $\pm$ 17.94	243.875 $\pm$ 24.75
	MAZE	132.25 $\pm$ 53.58	341.375 $\pm$ 13.33	347.5 $\pm$ 10.07	355.625 $\pm$ 14.23
Average ranking		4.0	3.0	1.33	1.67

Generalization ability of the communication modules In practical deployment, we aim for a robust communication module that can generalize to different masked environments. To verify the generalization ability of our learned communication module, we transfer the trained communication modules (both full information and specific information) to other environments and directly equip it to the agent and partner trained in those corresponding environments.

To verify the generalization ability of different communication modules, we equip the 6 communication modules (full information and specific information on 3 masks) to the 9 partners on the 3 masks and generate a

9 \times 6

heat-map, as shown in Fig.10. It is evident that generalization does incur some performance decline, but the differences in the full information are relatively small, indicating a certain level of generalization capability. As expected, the specific channels demonstrate a significant decline, especially when using Mask 2 in Mask 1 and Mask 3 environments. This is because both Mask 1 and Mask 3 obscure important information, which the specific channel of Mask 2 cannot provide. In the future, to achieve better communication modules, it is necessary to improve information extraction approaches. Additionally, exposure to a wider range of POMDP settings during training is crucial for training a robust communication module.

Fig.10 Generalization ability of communication modules

Full size|PPT slide

5.5 RQ4: Coordination with real humans.

Finally, we conduct experiments with real human participants to investigate the ability of coordinating with real humans of different methods. We recruit a total of 16 participants to evaluate five algorithms (FCP, MEP, MAZE, Hetec, HeteC-specific) on two different environments (CR and AA). These participants were randomly divided into two groups, with 8 participants assigned to test each environment. During the evaluation process, each participant experienced all five algorithms in a randomized order to mitigate any potential experimental biases. After completing these experiments, each participant provided subjective ratings for their five partners, expressing their preference for coordination with each agent. For the convenience of the experiment, we utilize the pre-trained communication module in HeteC to communicate with the agent, while the real human participant only needs to focus on operating the player. Our ethical statement is provided in Appendix A.3.

Main results We first show the environment reward achieved when coordinating with real human participants by five algorithms, as shown in Tab.7. Among these algorithms, HeteC and HeteC-specific demonstrate significant advantages over the compared baselines, achieving average ranking 1.67 and 2.0, respectively. MAZE is the best baseline method among the three baselines, which is consistent with the conclusions in Sections 5.2 and 5.3.

Tab.7 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with real human participants on the AA and CR layouts. For each combination of layout and mask type, the largest reward is bolded

Environment	Type	FCP	MEP	MAZE	HeteC	HeteC-specific
CR	Fully Observable	217.5 $\pm$ 25.37	207.5 $\pm$ 26.33	222.5 $\pm$ 21.06	215.0 $\pm$ 19.36	212.5 $\pm$ 22.22
	Mask 2	105.0 $\pm$ 21.79	92.5 $\pm$ 24.36	137.5 $\pm$ 30.72	210.0 $\pm$ 26.45	202.5 $\pm$ 25.37
	Mask 3	95.0 $\pm$ 23.97	100.0 $\pm$ 17.32	132.5 $\pm$ 26.33	200.0 $\pm$ 17.32	205.0 $\pm$ 16.58
AA	Fully Observable	337.5 $\pm$ 29.04	322.5 $\pm$ 30.72	342.5 $\pm$ 33.81	352.5 $\pm$ 26.33	347.5 $\pm$ 31.52
	Mask 2	132.5 $\pm$ 19.84	147.5 $\pm$ 13.91	207.5 $\pm$ 37.33	342.5 $\pm$ 21.06	335.0 $\pm$ 27.83
	Mask 3	112.5 $\pm$ 34.55	137.5 $\pm$ 25.37	197.5 $\pm$ 27.27	325.0 $\pm$ 39.68	327.5 $\pm$ 34.55
Average ranking		4.17	4.5	2.67	1.67	2.0

Human preference To further consider the humans’ subjective feedback, we additionally include a metric “Human preference” [21,23]. The human preference for method A over method B can be calculated as follows. Let

N

be the total number of human players participating in the experiment,

N_{A}

be the number of human players who rank A over B, and

N_{B}

be the number of those who rank B over A. Then, Human preference for method A over method B is computed as

\frac{N_{A}}{N} - \frac{N_{B}}{N}

. Each participant provide subjective ratings after coordinating with all the five agents, e.g., A>C>D>B>E. The overall human preference value is obtained by averaging the values on the two layouts. As shown in Fig.11, HeteC and HeteC-specific are better than the three baselines, which is consistent with the reward comparisons in Tab.4.

Fig.11 Human preferences of different methods on CR and AA layouts

Full size|PPT slide

6 Final remarks

In conclusion, this paper makes contributions to the field of ORC through the introduction of the problem formulation, ORCBench benchmark, and the HeteC framework. The ORCBench benchmark provides a standardized evaluation environment that incorporates heterogeneity and realistic conditions, enabling the comparison and assessment of coordination algorithms. The HeteC framework enhances coordination performance by incorporating mixed partner training algorithms, frozen archive, and a communication module, addressing the challenges posed by variations in capabilities, and limited observations in open and real-world environments. Through a series of experiments, the effectiveness of HeteC in improving coordination is demonstrated, serving as a crucial step towards the practical deployment and application of human-AI coordination in cooperative MARL.

There are several future directions to explore in the field of ORC. Firstly, expanding the scope beyond the current experiments and modeling other cooperative multi-agent environments as ORC problems would provide broader tasks. Additionally, addressing more complex and challenging scenarios, such as noisy environments and external attacks, would further enhance the robustness and adaptability of coordination strategies. One limitation of this work is we only use basic communication method. Incorporating improved communication methods, such as leveraging natural language processing techniques and large language models, could enhance the ability of AI agents to understand and generate human language, enabling more effective information exchange and coordination with human partners. These future research directions hold the potential to advance the field of ORC, enabling more sophisticated and efficient coordination in diverse and dynamic environments.

Cong Guan received the BSc degree and MSc degree from School of Mechanical Engineering and Automation, Northeastern University, China. He is currently pursuing the PhD degree with the Department of Computer Science and Technology, Nanjing University, China. His current research interests mainly include machine learning, reinforcement learning, and multi-agent reinforcement learning

Ke Xue received the BSc degree in Mathematics and Applied Mathematics from School of Mathematics, Sun Yat-Sen University, China in 2019. He is currently pursuing the PhD degree with the School of Artificial Intelligence, Nanjing University, China. His current research interests mainly include machine learning and black-box optimization

Chunpeng Fan received his MSc degree in communication engineering from Liaoning University of Technology, China in 2017. He is currently working in Polixir Technologies. His research interests include multi-agent reinforcement learning, multiagent system

Feng Chen received his BSc degree from School of Artificial Intelligence, Nanjing University, China in 2022. He is currently pursuing the MSc degree with the School of Artificial Intelligence, Nanjing University, Nanjing, China. His research interests include multi-agent reinforcement learning, multiagent system

Lichao Zhang received his MSc degree in Agricultural Electrification and Automation from Shihezi University, China in 2018. He is currently working in Polixir Technologies. His research interests include multi-agent reinforcement learning, multiagent system

Lei Yuan received the BSc degree in Department of Electronic Engineering in 2016 from Tsinghua University, and his MSc degree from Chinese Aeronautical Establishment, China in 2019. He is currently pursuing the PhD degree with the Department of Computer Science and Technology, Nanjing University, China. His current research interests mainly include machine learning, reinforcement learning, and multi-agent reinforcement learning

Chao Qian received PhD degree in the Department of Computer Science and Technology from Nanjing University, China in 2015, and is currently an associate professor at the School of Artificial Intelligence, Nanjing University, China. His research interests are mainly theoretical analysis of evolutionary algorithms, design of safe and efficient EAs, and evolutionary learning. He is an associate editor of IEEE Transactions on Evolutionary Computation, an associate editor of SCIENCE CHINA Information Sciences. He has regularly given tutorials and co-chaired special sessions at leading evolutionary computation conferences (CEC, GECCO, PPSN), and has been invited to give an Early Career Spotlight Talk “Towards Theoretically Grounded Evolutionary Learning” at IJCAI 2022

Yang Yu received the PhD degree in the Department of Computer Science and Technology from Nanjing University, China in 2011, and is currently a professor at the School of Artificial Intelligence, Nanjing University, China. His research interests include machine learning, mainly reinforcement learning and derivative-free optimization for learning. Prof. Yu was granted the CCF-IEEE CS Young Scientist Award in 2020, recognized as one of the AI’s 10 to Watch by IEEE Intelligent Systems, and received the PAKDD Early Career Award in 2018. His teams won the Champion of the 2018 OpenAI Retro Contest on transfer reinforcement learning and the 2021 ICAPS Learning to Run a Power Network Challenge with Trust. He served as Area Chairs for NeurIPS, ICML, IJCAI, AAAI, etc

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Klein G, Woods D D, Bradshaw J M, Hoffman R R, Feltovich P J . Ten challenges for making automation a “team player” in joint human-agent activity. IEEE Intelligent Systems, 2004, 19( 6): 91–95

[2]	Dafoe A, Bachrach Y, Hadfield G, Horvitz E, Larson K, Graepel T . Cooperative AI: machines must learn to find common ground. Nature, 2021, 593( 7857): 33–36

[3]	Hernandez-Leal P, Kartal B, Taylor M E . A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems, 2019, 33( 6): 750–797

[4]	Du W, Ding S F . A survey on multi-agent deep reinforcement learning: From the perspective of challenges and applications. Artificial Intelligence Review, 2021, 54( 5): 3215–3238

[5]	Oroojlooy A, Hajinezhad D . A review of cooperative multi-agent deep reinforcement learning. Applied Intelligence, 2023, 53( 11): 13677–13722

[6]	Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I. Multi-agent actor-critic for mixed cooperative-competitive environments. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6382−6393

[7]

Sunehag P, Lever G, Gruslys A, Czarnecki W M, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo J Z, Tuyls K, Graepel T. Value-decomposition networks for cooperative multi-agent learning based on team reward. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 2018, 2085−2087

[8]	Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4295−4304

[9]	Yu C, Velu A, Vinitsky E, Gao J, Wang Y, Bayen A M, Wu Y. The surprising effectiveness of PPO in cooperative multi-agent games. In: Proceedings of the 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 2022, 24611−24624

[10]	Gorsane R, Mahjoub O, De Kock R J, Dubb R, Singh S, Pretorius A. Towards a standardised performance evaluation protocol for cooperative marl. In: Proceedings of the 36th Conference on Neural Information Processing Systems, 2022, 5510−5521

[11]	Hu H, Lerer A, Peysakhovich A, Foerster J. “Other-play” for zero-shot coordination. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 409

[12]	Carroll M, Shah R, Ho M K, Griffiths T, Seshia S A, Abbeel P, Dragan A. On the utility of learning about humans for human-AI coordination. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 465

[13]	Yuan L, Li L, Zhang Z, Chen F, Zhang T, Guan C, Yu Y, Zhou Z H. Learning to coordinate with anyone. In: Proceedings of the 5th International Conference on Distributed Artificial Intelligence, 2023, 4

[14]	Zhou Z H . Open-environment machine learning. National Science Review, 2022, 9( 8): nwac123

[15]	Liu X, Liang J, Liu D Y, Chen R, Yuan S M . Weapon-target assignment in unreliable peer-to-peer architecture based on adapted artificial bee colony algorithm. Frontiers of Computer Science, 2022, 16( 1): 161103

[16]	Parmar J, Chouhan S, Raychoudhury V, Rathore S . Open-world machine learning: applications, challenges, and opportunities. ACM Computing Surveys, 2023, 55( 10): 205

[17]	Yuan L, Zhang Z, Li L, Guan C, Yu Y. A survey of progress on cooperative multi-agent reinforcement learning in open environment. 2023, arXiv preprint arXiv: 2312.01058

[18]	Stone P, Kaminka G A, Kraus S, Rosenschein J S. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence. 2010, 1504−1509

[19]	Mirsky R, Carlucho I, Rahman A, Fosong E, Macke W, Sridharan M, Stone P, Albrecht S V. A survey of ad Hoc teamwork research. In: Proceedings of the 19th European Conference on Multi-Agent Systems. 2022, 275−293

[20]	Lupu A, Cui B, Hu H, Foerster J. Trajectory diversity for zero-shot coordination. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 7204−7213

[21]	Strouse D J, McKee K R, Botvinick M, Hughes E, Everett R. Collaborating with humans without human data. In: Proceedings of the 35th Conference on Neural Information Processing Systems. 2021, 14502−14515

[22]	Zhao R, Song J, Yuan Y, Hu H, Gao Y, Wu Y, Sun Z, Yang W. Maximum entropy population-based training for zero-shot human-AI coordination. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 689

[23]	Yu C, Gao J, Liu W, Xu B, Tang H, Yang J, Wang Y, Wu Y. Learning zero-shot cooperation with humans, assuming humans are biased. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[24]	Wang X, Zhang S, Zhang W, Dong W, Chen J, Wen Y, Zhang W. Quantifying zero-shot coordination capability with behavior preferring partners. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[25]	Kapetanakis S, Kudenko D. Reinforcement learning of coordination in heterogeneous cooperative multi-agent systems. In: Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems. 2004, 1258−1259

[26]	Wang C, Pérez-D’Arpino C, Xu D, Li F F, Liu K, Savarese S. Co-GAIL: Learning diverse strategies for human-robot collaboration. In: Proceedings of the 5th Conference on Robot Learning. 2022, 1279−1290

[27]	Xue K, Wang Y, Guan C, Yuan L, Fu H, Fu Q, Qian C, Yu Y. Heterogeneous multi-agent zero-shot coordination by coevolution. 2022, arXiv preprint arXiv: 2208.04957

[28]	Cabrera C, Paleyes A, Thodoroff P, Lawrence N D. Real-world machine learning systems: a survey from a data-oriented architecture perspective. 2023, arXiv preprint arXiv: 2302.04810

[29]	Davenport T H, Ronanki R. Artificial intelligence for the real world. Harvard Business Review, 2018, 96(1): 108−116

[30]	Fontaine M C, Hsu Y C, Zhang Y, Tjanaka B, Nikolaidis S. On the importance of environments in human-robot coordination. In: Proceedings of the 17th Robotics: Science and Systems 2021. 2021

[31]	Busoniu L, Babuska R, De Schutter B . A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2008, 38( 2): 156–172

[32]	Zhang K, Yang Z, Başar T. Multi-agent reinforcement learning: a selective overview of theories and algorithms. In: Vamvoudakis K G, Wan Y, Lewis F L, Cansever D, eds. Handbook of Reinforcement Learning and Control. Cham: Springer, 2021, 321−384

[33]	Sartoretti G, Kerr J, Shi Y, Wagner G, Kumar T K S, Koenig S, Choset H . Primal: pathfinding via reinforcement and imitation multi-agent learning. IEEE Robotics and Automation Letters, 2019, 4( 3): 2378–2385

[34]	Wang J, Xu W, Gu Y, Song W, Green T C. Multi-agent reinforcement learning for active voltage control on power distribution networks. In: Proceedings of the 35th Conference on Advances in Neural Information Processing Systems. 2021, 3271−3284

[35]	Xue K, Xu J, Yuan L, Li M, Qian C, Zhang Z, Yu Y. Multi-agent dynamic algorithm configuration. In: Proceedings of the 36th Conference on Advances in Neural Information Processing Systems. 2022, 20147−20161

[36]	Wen M, Kuba J G, Lin R, Zhang W, Wen Y, Wang J, Yang Y. Multi-agent reinforcement learning is a sequence modeling problem. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 16509−16521

[37]	Samvelyan M, Rashid T, De Witt C S, Farquhar G, Nardelli N, Rudner T G J, Hung C, Torr P H S, Foerster J N, Whiteson S. The starcraft multi-agent challenge. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 2186−2188

[38]	Bard N, Foerster J N, Chandar S, Burch N, Lanctot M, Song H F, Parisotto E, Dumoulin V, Moitra S, Hughes E, Dunning I, Mourad S, Larochelle H, Bellemare M G, Bowling M . The hanabi challenge: A new frontier for AI research. Artificial Intelligence, 2020, 280: 103216

[39]	Zhu C, Dastani M, Wang S. A survey of multi-agent reinforcement learning with communication. 2022, arXiv preprint arXiv: 2203.08975

[40]	Zhang F, Jia C, Li Y C, Yuan L, Yu Y, Zhang Z. Discovering generalizable multi-agent coordination skills from multi-task offline data. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[41]	Wang X, Zhang Z, Zhang W. Model-based multi-agent reinforcement learning: Recent progress and prospects. 2022, arXiv preprint arXiv: 2203.10603

[42]	Guo J, Chen Y, Hao Y, Yin Z, Yu Y, Li S. Towards comprehensive testing on the robustness of cooperative multi-agent reinforcement learning. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 2022

[43]	Yuan L, Zhang Z, Xue K, Yin H, Chen F, Guan C, Li L, Qian C, Yu Y. Robust multi-agent coordination via evolutionary generation of auxiliary adversarial attackers. In: Proceedings of the 37th AAAI Conference on Artificial Intelligence. 2023, 1319

[44]	Foerster J N, Assael Y M, De Freitas N, Whiteson S. Learning to communicate with deep multi-agent reinforcement learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2145−2153

[45]	Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with backpropagation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2252−2260

[46]	Ding Z, Huang T, Lu Z. Learning individually inferred communication for multi-agent cooperation. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1851

[47]	Mao H, Zhang Z, Xiao Z, Gong Z, Ni Y. Learning agent communication under limited bandwidth by message pruning. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 5142−5149

[48]	Yuan L, Wang J, Zhang F, Wang C, Zhang Z, Yu Y, Zhang C. Multi-agent incentive communication via decentralized teammate modeling. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022, 9466−9474

[49]	Zhang S Q, Zhang Q, Lin J. Efficient communication in multi-agent reinforcement learning via variance based control. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 291

[50]	Zhang S Q, Zhang Q, Lin J. Succinct and robust multi-agent communication with temporal message control. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1449

[51]	Guan C, Chen F, Yuan L, Wang C, Yin H, Zhang Z, Yu Y. Efficient multi-agent communication via self-supervised information aggregation. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 1020−1033

[52]	Das A, Gervet T, Romoff J, Batra D, Parikh D, Rabbat M, Pineau J. TarMAC: Targeted multi-agent communication. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 1538−1546

[53]	Guan C, Chen F, Yuan L, Zhang Z, Yu Y. Efficient communication via self-supervised information aggregation for online and offline multi-agent reinforcement learning. 2023, arXiv preprint arXiv: 2302.09605

[54]	Yuan L, Jiang T, Li L, Chen F, Zhang Z, Yu Y. Robust multi-agent communication via multi-view message certification. 2023, arXiv preprint arXiv: 2305.13936

[55]	Yuan L, Chen F, Zhang Z, Yu Y . Communication-robust multi-agent learning by adaptable auxiliary multi-agent adversary generation. Frontiers of Computer Science, 2024, 18( 6): 186331

[56]	Gwak J, Jung J, Oh R, Park M, Rakhimov M A K, Ahn J . A review of intelligent self-driving vehicle software research. KSII Transactions on Internet and Information Systems (TIIS), 2019, 13( 11): 5299–5320

[57]	Andrychowicz O M, Baker B, Chociej M, Józefowicz R, McGrew B, Pachocki J, Petron A, Plappert M, Powell G, Ray A, Schneider J, Sidor S, Tobin J, Welinder P, Weng L L, Zaremba W . Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 2020, 39( 1): 3–20

[58]	Engelbart D C. Augmenting human intellect: a conceptual framework. Stanford Research Institute, 2023

[59]	Carter S, Nielsen M . Using artificial intelligence to augment human intelligence. Distill, 2017, 2( 12): e9

[60]	Hu H, Lerer A, Cui B, Pineda L, Brown N, Foerster J N. Off-belief learning. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 4369−4379

[61]	Treutlein J, Dennis M, Oesterheld C, Foerster J. A new formalism, method and open issues for zero-shot coordination. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 10413−10423

[62]	Li Y, Zhang S, Sun J, Du Y, Wen Y, Wang X, Pan W. Cooperative open-ended learning framework for zero-shot coordination. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 844

[63]	Oliehoek F A, Amato C. A Concise Introduction to Decentralized POMDPs. Cham: Springer, 2016

[64]	Xue W, Qiu W, An B, Rabinovich Z, Obraztsova S, Yeo C K. Mis-spoke or mis-lead: Achieving robustness in multi-agent communicative reinforcement learning. In: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2022, 1418−1426

[65]	Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap T, Simonyan K, Hassabis D. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. 2017, arXiv preprint arXiv: 1712.01815

[66]	Tesauro G . TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 1994, 6( 2): 215–219

[67]	Jaderberg M, Dalibard V, Osindero S, Czarnecki W M, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K. Population based training of neural networks. 2017, arXiv preprint arXiv: 1711.09846

[68]	Lucas K, Allen R E. Any-play: an intrinsic augmentation for zero shot coordination. In: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems. 2022, 853–861

[69]	Mondal W U, Agarwal M, Aggarwal V, Ukkusuri S V . On the approximation of cooperative heterogeneous multi-agent reinforcement learning (MARL) using mean field control (MFC). Journal of Machine Learning Research, 2022, 23( 1): 129

[70]	Kuba J G, Feng X, Ding S, Dong H, Wang J, Yang Y. Heterogeneous-agent mirror learning: A continuum of solutions to cooperative MARL. 2022, arXiv preprint arXiv: 2208.01682

[71]	Charakorn R, Manoonpong P, Dilokthanakul N. Generating diverse cooperative agents by learning incompatible policies. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[72]	Lou X, Guo J, Zhang J, Wang J, Huang K, Du Y. PECAN: leveraging policy ensemble for context-aware zero-shot human-AI coordination. In: Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems. 2023, 679−688

[73]	Zheng S, Trott A, Srinivasa S, Naik N, Gruesbeck M, Parkes D C, Socher R. The AI economist: Improving equality and productivity with AI-Driven tax policies. 2020, arXiv preprint arXiv: 2004. 13332

[74]	Bäck T. Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. New York: Oxford University Press, 1996

[75]	Hao H, Zhang X, Zhou A . Enhancing SAEAs with unevaluated solutions: A case study of relation model for expensive optimization. Science China Information Sciences, 2024, 67( 2): 120103

[76]	Wang Y, Xue K, Qian C. Evolutionary diversity optimization with clustering-based selection for reinforcement learning. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[77]	Demšar J . Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 2006, 7: 1–30

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2020AAA0107200), the National Natural Science Foundation of China (Grant Nos. 61921006, 61876119, 62276126), the Natural Science Foundation of Jiangsu (BK20221442). We thank Lihe Li and Ziqian Zhang for their useful suggestions and discussions.

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.

RIGHTS & PERMISSIONS

2025 Higher Education Press

AI Summary AI Mindmap

PDF(3226 KB)

Supplementary files

FCS-23797-OF-CG_suppl_1 (343 KB)

726

Accesses

Citations

Detail

Sections

Recommended

Abstract
Graphical abstract
Keywords
Cite this article
1 Introduction
2 Related work
2.1 Multi-agent reinforcement learning
2.2 Human-AI coordination
3 Problem formulation
3.1 Open-world coordination
3.2 Real-world coordination
3.3 ORC benchmark
Fig.1 Illustration of the Overcooked game
Fig.2 Illustration of different layouts on Overcooked. (a) CR and H-CR; (b) AA; (c) AA-2; (d) FC
Fig.3 Illustration of the emergency rescue environment
4 Method
Fig.4 Illustration of the high-level workflow of HeteC
4.1 Heterogeneous training framework
4.2 Communication module
Fig.5 Illustration of the policy network and communication module
Fig.6 Illustration of the HeteC on FC layout of Overcooked
5 Experiments
5.1 Experimental settings
Tab.1 Observation channels and Masks in Overcooked environment
5.2 RQ1: Challenge of ORC
Tab.2 The reward (mean±std.) achieved by the compared algorithms when testing with different partners on CR and AA layouts. For each combination of layout and partner, the largest reward is bolded
5.3 RQ2: Performance of HeteC
Tab.3 The reward (mean±std.) achieved by the compared algorithms when testing with different partners on CR, H-CR, AA, AA-2, and FC layouts of Overcooked. For each combination of layout and partner, the largest reward is bolded
Tab.4 The reward (mean±std.) achieved by the compared algorithms when testing with different partners on 1-1, 1-2, 1-3 layouts of Emergency Rescue. For each combination of layout and partner, the largest reward is bolded
Fig.7 Training curves of MAZE and HeteC on fully observable and different masks of the AA and 1-2 layouts. (a) MAZE-AA; (b) HeteC-AA; (c) MAZE-1-2; (d) HeteC-1-2
Fig.8 Performance comparison of different methods on CR and AA layouts when testing with different partners. The gray shaded bar is used to denote the degree of performance variation, which equals to the highest performance on the two settings minus the performance on fully observable environment. (a) CR; (b) AA
5.4 RQ3: Further studies
Tab.5 The reward (mean ± std.) achieved by the compared algorithms when testing with different partners on the AA layout. The POMDP environment is Mask 2. For each combination of layout and partner, the largest reward is bolded
Fig.9 Sensitivity analysis on frozen ratio α on AA layout using Mask 2. 0.6 is used in our main experiments
Tab.6 The reward (mean±std.) achieved by the compared algorithms when testing with different partners on the AA layout. For each combination of layout and partner, the largest reward is bolded
Fig.10 Generalization ability of communication modules
5.5 RQ4: Coordination with real humans.
Tab.7 The reward (mean±std.) achieved by the compared algorithms when testing with real human participants on the AA and CR layouts. For each combination of layout and mask type, the largest reward is bolded
Fig.11 Human preferences of different methods on CR and AA layouts
6 Final remarks
References
Acknowledgements
Competing interests
RIGHTS & PERMISSIONS

Received	Accepted	Published
07 Oct 2023	13 Mar 2024	15 Apr 2025
Just Accepted Date	Issue Date
22 Mar 2024	05 Jun 2024

About the journal

Browse

Authors & reviewers

Abstract

Graphical abstract

Keywords

Cite this article

1 Introduction

2 Related work

2.1 Multi-agent reinforcement learning

2.2 Human-AI coordination

3 Problem formulation

3.1 Open-world coordination

3.2 Real-world coordination

3.3 ORC benchmark

Fig.1 Illustration of the Overcooked game

Fig.2 Illustration of different layouts on Overcooked. (a) CR and H-CR; (b) AA; (c) AA-2; (d) FC

Fig.3 Illustration of the emergency rescue environment

4 Method

Fig.4 Illustration of the high-level workflow of HeteC

4.1 Heterogeneous training framework

4.2 Communication module

Fig.5 Illustration of the policy network and communication module

Fig.6 Illustration of the HeteC on FC layout of Overcooked

5 Experiments

5.1 Experimental settings

Tab.1 Observation channels and Masks in Overcooked environment

5.2 RQ1: Challenge of ORC

Tab.2 The reward (mean±std.) achieved by the compared algorithms when testing with different partners on CR and AA layouts. For each combination of layout and partner, the largest reward is bolded

5.3 RQ2: Performance of HeteC

Tab.3 The reward (mean±std.) achieved by the compared algorithms when testing with different partners on CR, H-CR, AA, AA-2, and FC layouts of Overcooked. For each combination of layout and partner, the largest reward is bolded

Tab.4 The reward (mean±std.) achieved by the compared algorithms when testing with different partners on 1-1, 1-2, 1-3 layouts of Emergency Rescue. For each combination of layout and partner, the largest reward is bolded

Fig.7 Training curves of MAZE and HeteC on fully observable and different masks of the AA and 1-2 layouts. (a) MAZE-AA; (b) HeteC-AA; (c) MAZE-1-2; (d) HeteC-1-2

5.4 RQ3: Further studies

Tab.5 The reward (mean ± std.) achieved by the compared algorithms when testing with different partners on the AA layout. The POMDP environment is Mask 2. For each combination of layout and partner, the largest reward is bolded

Fig.9 Sensitivity analysis on frozen ratio α on AA layout using Mask 2. 0.6 is used in our main experiments

Tab.6 The reward (mean±std.) achieved by the compared algorithms when testing with different partners on the AA layout. For each combination of layout and partner, the largest reward is bolded

Fig.10 Generalization ability of communication modules

5.5 RQ4: Coordination with real humans.

Tab.7 The reward (mean±std.) achieved by the compared algorithms when testing with real human participants on the AA and CR layouts. For each combination of layout and mask type, the largest reward is bolded

Fig.11 Human preferences of different methods on CR and AA layouts

6 Final remarks

{{custom_sec.title}}

{{custom_sec.title}}

References

Acknowledgements

Competing interests

RIGHTS & PERMISSIONS

Tab.2 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on CR and AA layouts. For each combination of layout and partner, the largest reward is bolded

Tab.3 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on CR, H-CR, AA, AA-2, and FC layouts of Overcooked. For each combination of layout and partner, the largest reward is bolded

Tab.4 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on 1-1, 1-2, 1-3 layouts of Emergency Rescue. For each combination of layout and partner, the largest reward is bolded

Tab.5 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on the AA layout. The POMDP environment is Mask 2. For each combination of layout and partner, the largest reward is bolded

Fig.9 Sensitivity analysis on frozen ratio $α$ on AA layout using Mask 2. 0.6 is used in our main experiments

Tab.6 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with different partners on the AA layout. For each combination of layout and partner, the largest reward is bolded

Tab.7 The reward (mean $\pm$ std.) achieved by the compared algorithms when testing with real human participants on the AA and CR layouts. For each combination of layout and mask type, the largest reward is bolded