CCA: collaborative competitive agents for image editing

Tiankai HANG; Shuyang GU; Dong CHEN; Xin GENG; Baining GUO

doi:10.1007/s11704-025-41244-0

Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (11) : 1911367 DOI: 10.1007/s11704-025-41244-0

Artificial Intelligence

RESEARCH ARTICLE

CCA: collaborative competitive agents for image editing

Tiankai HANG ¹
, Shuyang GU ²
, Dong CHEN ²
, Xin GENG ¹^,^†
, Baining GUO ¹^,²^,^†

Author information +

History +

PDF (8317KB)

Abstract

This paper presents a novel generative model, Collaborative Competitive Agents (CCA), which leverages the capabilities of multiple Large Language Models (LLMs) based agents to execute complex tasks. Drawing inspiration from Generative Adversarial Networks (GANs), the CCA system employs two equal-status generator agents and a discriminator agent. The generators independently process user instructions and generate results, while the discriminator evaluates the outputs, and provides feedback for the generator agents to further reflect and improve the generation results. Unlike the previous generative model, our system can obtain the intermediate steps of generation. This allows each generator agent to learn from other successful executions due to its transparency, enabling a collaborative competition that enhances the quality and robustness of the system’s results. The primary focus of this study is image editing, demonstrating the CCA’s ability to handle intricate instructions robustly. The paper’s main contributions include the introduction of a multi-agent-based generative model with controllable intermediate steps and iterative optimization, a detailed examination of agent relationships, and comprehensive experiments on image editing.

Graphical abstract

Keywords

image editing / agents / collaborative and competitive

Cite this article

Download citation ▾

Tiankai HANG, Shuyang GU, Dong CHEN, Xin GENG, Baining GUO. CCA: collaborative competitive agents for image editing. Front. Comput. Sci., 2025, 19(11): 1911367 DOI:10.1007/s11704-025-41244-0

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

The human endeavor to conceptualize Artificial Intelligence (AI) is fundamentally rooted in the aspiration to engineer intelligent entities. In the contemporary era, this pursuit has been significantly propelled by the evolution and advancement of Large Language Models (LLMs) [1–5]. The rapidly developing LLM-based agents [6–8] have outpaced their predecessors, demonstrating a higher degree of intelligence through a more sophisticated understanding of human intentions and a greater competency in assisting with complex tasks. This progression has instigated a technological revolution across a multitude of fields, encompassing software development [7], education [8,9], sociology [10], among others.

When examining the realm of generative models, specifically Generative Adversarial Networks (GANs) [11–14] and diffusion models [15–22], we encounter two notable challenges. The first challenge is the models’ limited ability to process complex, compound tasks. To illustrate, consider a task that involves “colorizing an old photograph, replacing the depicted individual with the user’s image, and adding a hoe in the user’s hand”. Such a multifaceted task surpasses the capability of even the most advanced generative models. The second challenge arises in the update process of a generated result. This process is contingent upon the preservation of the compute graph. However, the sheer volume of results generated by diverse algorithms makes maintaining this compute graph a significant hurdle. Consequently, this creates a barrier to learning from other generative models, given their black-box nature.

In this paper, we introduce a novel generative model that harnesses the capabilities of multiple LLM-based agents, which effectively circumvents these two challenges. Leveraging the agents’ powerful task decomposition abilities, our model can efficiently manage highly complex tasks. Simultaneously, during the generation process, we can extract insights into how the agents comprehend, dissect, and execute the task, enabling us to modify internal steps and enhance the results. Crucially, the model’s transparency allows the agent to learn from successful executions by other agents, moving away from the black-box model paradigm. We underscore that this transparency is a pivotal factor contributing to the enhanced quality and robustness of the system.

Generative Adversarial Networks (GANs) [11] can be viewed as an early endeavor to incorporate a multi-agent system into generative models. GANs utilize two agents, namely, a generator and a discriminator. A cleverly designed optimization function allows these agents to learn from their adversarial interaction, ideally reaching a Nash equilibrium. Similarly, in our multi-agent system, we have discovered that the establishment of relationships between different agents is a critical determinant of success.

Drawing inspiration from GANs, our system employs two generators and one discriminator. The two generator agents, of equal status, independently process user instructions and generate results. The discriminator agent then evaluates these generated results, providing feedback to each generator and determining which result is superior. The generator agents have dual responsibilities. Firstly, they must reflect on the feedback from the discriminator. Secondly, they should consider the results produced by the other generator agent to enhance their generation process. This process is iteratively carried out until the discriminator deems the best result to have sufficiently met the user’s requirements. We underscore that through this collaborative competition, the two generators can continuously augment the quality and robustness of the system’s results. Consequently, we have named our system Collaborative Competitive Agents (CCA).

In this paper, we concentrate on image editing, although our CCA system is a versatile generative model. Conventional image editing methods [23–25] fall short when dealing with intricate instructions, resulting in less robust outcomes. Our proposed generative model can considerably enhance this situation through the collaborative competition of multiple agents.

In summary, our primary contributions are as follows:

(1) We introduce a new generative model based on multiple agents, which features controllable intermediate steps and can be iteratively optimized.

(2) We have meticulously examined the relationships among multiple agents, highlighting that reflection, cooperation, and competition are integral to the system’s quality and robustness.

(3) We have conducted comprehensive experiments on image editing, demonstrating for the first time the ability to robustly handle complex instructions.

2 Related work

2.1 Large language model-based agents

Agents are artificial entities capable of perceiving the environment, making decisions, and taking actions to accomplish specific goals [26–29]. Recent advancements in Large Language Models (LLMs) have demonstrated significant intelligence [30,31], offering promising avenues for the evolution of intelligent agents. LLM-based agents possess the ability to memorize, plan, and utilize tools. The “memory” feature allows these agents to store sequences of past observations, thoughts, and actions for future retrieval. The CoT [32] enhances the LLM’s capacity to solve complex tasks by “thinking step by step”. Moreover, agents employ a reflection mechanism [33–35] to enhance their planning abilities. Furthermore, LLM-based agents can leverage various tools to interact with their environment [2,30,36], such as shopping [6] and coding [7]. Some studies [37] equip these agents with embodied actions to facilitate interaction with humans.

Similar to human society, a single skilled agent can handle specific tasks, while a multi-agent system can tackle more complex ones. To foster autonomous cooperation among agents, CAMEL [38] introduces a novel communicative agent framework called “role-playing”, which incorporates inception prompting. AgentVerse [39] introduces a multi-task-tested framework for group agents collaboration, designed to assemble a team of agents that can dynamically adjust to the intricacy of the task at hand. ChatDev [7] demonstrates significant potential in software development by integrating and unifying key processes through natural language communication. Concurrently, ChatEval [40] employs multiple agents as a referee team, they engage in debates among themselves and ultimately determine a measure of the quality of the LLM generation. Pure collaboration means that agents work together, completing their respective parts to achieve a common goal. Competition, on the other hand, implies rivalry, where each agent’s objective is to pursue their own success, making their own plans and decisions based on feedback from both themselves and other agents. Therefore, this is not a situation of complete cooperation. We refer to this type of competition as collaborative competition. In this paper, we explore a scenario where multiple agents with collaborative competition to achieve goals.

2.2 Image editing

Image Editing has been a long-standing vision task and is widely used in real-world applications [23,24,41–43]. The primary objective of this task is to manipulate an image to align with user-specified requirements. Traditional methods mainly address specific tasks like style transfer [44–47], image translation [48,49], and object removal or replacement [50–53]. Later works [41,54] utilize text-to-image models to perform edits from user-provided text instructions.

From the generative model perspective, many early studies leveraged the outstanding disentangled properties in the latent space of GANs [13,14], altering image attributes via latent code manipulation [55–57]. Some studies [58] use CLIP [59] to facilitate text-driven image editing. Recent advancement in diffusion models [15–18,60] has demonstrated great success in image generation [61–64], with these pre-trained diffusion models widely used in image editing. Meng et al. [25] proposed reversing and perturbing the SDE to conduct image manipulation, while EDCIT [65] enhanced editing quality through more precise inversion. Prompt-to-Prompt [24] investigated the role of words in attention and edited images by modifying the cross-attention. Null-text Inversion [43] progresses by optimizing tokens to avoid artifacts from classifier-free guidance. Dreambooth [66] fine-tuned the pre-trained text-to-image diffusion model to perform subject-driven generation. More recent work, VisProg [67], leverages the in-context learning capabilities of LLMs to perform image editing. However, itheavily relies on the given examples and choices for the most suitable tool. In contrast to it, our proposed CCA is a multi-agent system that not only includes tools for planning and execution but also iteratively improves the outcomes based on feedback through collaboration and competition.

3 Method

Our framework, identified as CCA, incorporates two distinct types of agents: the Generator Agent and the Discriminator Agent.

3.1 Generator agent

The generator agent edits the image through the utilization of two core modules: the Planner, which is engaged in deciphering the user’s request and making plans, and the Tool Executor, responsible for systematically modifying the image in a step-by-step manner.

3.1.1 Planner

Existing image editing models often grapple with effectively managing complex user requirements. To address it, we employ a Planner agent, denoted as

P

, to decompose these requirements into several straightforward and clear subtasks. Typically, user requirements encompass an input image I and an associated editing goal, G. To enable the Large Language model to comprehend the image, we utilize LLaVA-1.5 [68] to get the image caption

C

, which serves as a preliminary understanding of the input image. For simplicity, We denote these three elements as user requirements, R.

The Planner agent

P

takes the requirements R as inputs, decomposing them into a sequence of subtasks, represented as

{s j} j = 1 n

(1)

{s 1, s 2, …, s n} = P (R, T) .

Each subtask

s j

contains a clear goal to be achieved and a selected tool from Toolset

T

to accomplish this goal. The toolset outlines which tools are available to the agent and their respective functions, which will be detailed further in Section 4.1.

For instance, if the user request is “Make the background a county fair and have the man a cowboy hat, 512pix”. It can be divided into several sequential steps: 1. Subtask

s 1

involves loading and resizing the image to a resolution of 512 pixels using the

R e s i z e

tool; 2. Subtask

s 2

requires Changing the background to a county fair using the editing tool

E d i c t

[65]; 3. Lastly, subtask

s 3

requires adding a cowboy hat to the man using the

I n s t r u c t D i f f u s i o n

[41] tool.

It’s plausible that multiple tools could handle the same sub-task, but each tool may have distinct advantages in different scenarios. Consequently, generating an optimal plan on the first attempt can be challenging. In response to this, we have introduced a strategy that enhances the planning process through multiple rounds of reflection. The reflection mechanism is designed to incrementally improve the plan to meet the editing requirements by utilizing feedback. The feedback F assesses the success of achieving a sub-goal with the chosen tool and determines whether modifications to the sub-goal are necessary. The feedback is generated by the discriminator agent, which will be further discussed in Section 3.2.1.

In summary, beyond its primary function, the Planner agent also serves to reflect upon and enhance plans based on the preceding plan:

(2)

S m = P (R, T, S m − 1, F m − 1),

where the superscript

m

denotes the current round of plan. In the initial round, when

m

equals 1, there are neither feedbacks nor previous plans available, thus both

S (0)

and

F (0)

are set to

∅

(empty).

3.1.2 Executor

When the Planner agent generates a detailed plan that specifies which tool should be employed for each task, we engage another agent, the Executor

E

, to use the corresponding tools to sequentially execute the plan

{s 1, s 2, …, s n}

. For each individual subtask

s j

, the Executor should meticulously explore how to optimally leverage the tool to accomplish it.

For its initial run, the Executor should carefully review the tool’s detailed instructions, and appropriately format the input to engage the tool. In subsequent runs, the executor may receive feedback on previous execution results, then it should adjust the hyperparameters according to these previous results to enhance future outcomes. The entire process can be formulated as follows:

(3)

o j + 1 = E (s j, o j, f j), j = 1, 2, …, n .

In this process,

o j

and

f j

represent the previously generated results and feedback, respectively. The system output is defined as

O = o n + 1

For instance, during the initial run, the Executor may not effectively “add a hat” due to the use of inappropriate classifier-free guidance. In response to the feedback signal “the hat has not been added”, the Executor may enhance the classifier-free guidance to improve performance.

3.2 Discriminator agent

Evaluating the results is a crucial step towards their improvement. Hence, we employ a Discriminator Agent which serves a dual purpose. Firstly, it is responsible for assessing the quality of the edited images and providing valuable feedback that contributes to the enhancement of these results. Secondly, it is tasked with selecting the best results to present to the user.

3.2.1 Generate feedback

Given the caption

C

of the input image and the user request

R

, we can design several questions to assess whether the generated image meets the stipulated requirements. Questions assess specific request items and overall editing quality in two parts. The first part verifies if edits match user requirements, typically prompting a binary “Yes/No” and explanation. The second part rates overall quality, considering naturalness and aesthetics.

The agent sequentially answers these evaluative questions. For questions concerning specific editing items, the agent enlists help from LLMs with visual question-answering capabilities, such as LLaVA [68] or GPT-4V [2]. These models take the edited image and question as input and output the answer. As for the comprehensive quality assessment, we not only rely on these types of LLMs, but we also incorporate the Aesthetic Predictor [69] to evaluate the naturalness and visual appeal of the output.

We compile responses into a succinct feedback report. The feedback offers clear, actionable directions for editing enhancement. The entire process can be formalized as follows,

(4)

F = F B (O, R),

where

F B (⋅)

is the agent to generate feedback

F

, and

O

denotes the output from the generator.

To enhance the generator’s ability to reflect and improve through feedback, we undertake two measures: Firstly, we transmit the feedback from both generator agents back to their respective origins, enabling them to learn from each other’s successful strategies or avoid redundant exploration. Secondly, we dissect the overall feedback for each subtask

s j

. This process enables the planner to more effectively discern the appropriateness of the goal and tools with each subtask. Additionally, the specific feedback guides the executor in fine-tuning hyperparameters, even with unchanged goals.

Consider the first generator agent in Fig.1, in the first round, the feedback is “A rainbow is visible in the sky, there are no sunflowers in the field, and there is no indication of a wooden barn in the image. Unpleasant visual artifacts are present in the photo”. The decomposed feedback for each subtask is: 1. Rainbow is successfully added; 2. There are no sunflowers in the field, suggest changing the tool to

G r o u n d i n g D I N O I n p a i n t i n g

; 3. There is no wooden barn in the image.

3.2.2 Quality competitor

In order to achieve results of higher quality and robustness, we leverage two generator agents to generate results and engage a quality Competitor agent

Q C

to choose the superior one. For each edited image

O

, the competitor agent should compare their corresponding feedback

F

, and select the one that best aligns with the user’s request.

In addition to competing in the generated results of these two agents, competition can also occur across different rounds. Specifically, the agent maintains a memory bank

M

to update the current best result

I best

. This process can be formalized as:

(5)

I best, F best = Q C ({O, F}, M),

(6)

M = {M, (I best, F best)} .

In this formula,

Q C (⋅)

represents the quality competitor function that determines the best result.

{O, F}

is the set of different generator agents’ output and feedback at the current run.

The competitor agent also plays a crucial role in deciding when to terminate the process. In each round, the agent checks if the current best result sufficiently meets the user’s requirements. It is not necessary to proceed through all the rounds if the image quality is already deemed satisfactory. As discussed in Section 3.2.2, the Quality Competitor plays an important role in obtaining the best result and facilitating early termination.

3.3 Collaborative competitive agents

Our whole system contains two generative agents and one discriminator agent. The discriminator agent selects the best result for the user and provides feedback to the generative agents. Generative agents refine their strategies using both self-feedback and insights from peers. For example, if the first agent’s plan is to “Perform subtask A first and then subtask B”, and the second plan is to reverse the order of subtasks. If the discriminator agent recognizes that the second one gets a better result, the first agent may learn from this feedback to change the order for better performance. This demonstrates agents’ cooperation via shared feedback. In Section 4.3.2, we demonstrate that such collaboration will make the system more robust and achieve better results.

In addition to cooperation, the discriminator will also promote competition between the two generative agents. During the initial stage of plan generation, different agents generate various outcomes due to randomness. These variations result in distinct edited results and subsequent feedback. Agents that produce poor results use feedback to improve their results, trying to produce better results than their counterparts. The discriminator agent benefits from this competitive mechanism as well. By learning from the edited images and feedback generated by various agents, the discriminator agent can provide more refined feedback and suggestions to the generator agent. If there’s no discernible difference between the edited results, the discriminator agent will suggest selecting an alternative, more suitable tool to accomplish the sub-goal. The whole algorithm of Collaboration Competition Agent is shown in Algorithm 1.

3.4 Hierarchical tool configuration

It’s a huge challenge for agents, especially generator agents, to understand and accurately utilize various tools. Thus we propose a hierarchical tool configuration. For each tool, it should comprise a tool name, a description, and a user manual. The description succinctly articulates the tool’s functionality, typically in one or several sentences. The manual offers detailed usage instructions, parameter effects, and input/output specifications.

Given the tool diversity, manuals are detailed. The Planner agent only reads the description of all the tools, decompose user request into sub-goals, and choose an appropriate tool for each sub-goal. The Tool Executor agent takes the user manual of the corresponding tool as input and designs the necessary parameters to use the tool.

4 Experiments

In this section, we initially explore the challenges inherent in the construction of such a system, providing an in-depth analysis of several key components in the process. Subsequently, we perform an ablation study to evaluate the efficacy of our individual components. Lastly, we compare our method with other image editing techniques to highlight the advantages of our approach.

4.1 Implementation details

We aim to build an automatic system to complete task A. The two most important parts are what kind of “brain” is used to think about the problem, and what tools are used to complete the task. For the Planner and Feedback part, we leverage GPT-4 to develop plans, generate feedback and suggestions. For the Tool Executor and Quality Competitor, we adopt GPT-3.5-turbo for its speed.

The type and quality of the toolset directly determine the complexity of tasks that can be accomplished and the quality of their completion. We furnish our generator agents with a diverse set of 20 tools, which fall broadly into several categories: Image Preprocessing, Localization, Understanding, Conditional Generation, and General Editing. For the discriminator agent, we deploy the state-of-the-art, open-source, large multi-modal model (LMM), LLaVA-1.5 [68], which is designed to understand and evaluate the quality of the edited images. LLaVA-1.5 excels in detailed captioning and VQA. Additionally, we also utilize an Aesthetic Predictor [69] to evaluate the overall perceptual quality of the results. Further details are provided in the supplementary material.

4.2 Step-by-step to build the framework

4.2.1 Yes/No questions rather than what

Feedback plays a pivotal role in enhancing the quality of editing. To glean information from the edited image, we design several questions and employ LLaVA [65] to answer these questions based on the edited image. Question design is key to eliciting precise feedback.

Initially, we attempted to ask questions like “What about the results generated in the figure?”. However, we found that it was challenging for LLaVA to assess the extent of the edited item, and it was also difficult to generate a suggestion according to the vague answer. Fig.2 showcases that ambiguous questions may lead to confusion about whether the object has been successfully edited. In contrast, “Yes/No” questions tend to be answered with greater accuracy. Consequently, we modified our approach to pose “Yes/No” questions according to the user requirements, which yielded more precise feedback and better suggestions.

4.2.2 Tool diversity

While several studies [23,41] claim the ability to manage diverse types of editing tasks, such as object addition, removal, and replacement, these capabilities fall short of addressing the varied needs of practical applications. Conversely, even when multiple tools are employed for the same task, each may possess its own unique strength, potentially leading to synergistic enhancements. To illustrate this, we choose InstructDiffusion [41] as our baseline method, which can handle a wide range of image editing tasks following user instructions through the diffusion process. Meanwhile, we also incorporate EDICT [65] and GroundingDINO+Inpainting [19,70] to expand our toolset. As depicted in Fig.3, the exclusive reliance on a single tool yields subpar result. In this example, the objective is to transform the house into a wooden one and replace the front stone with flowers. The single tool with InstructDiffusion failed to locate the front stone while with GroundingDINO’s detection ability, multi-tool setting achieved a more reasonable result.

4.2.3 Stopping criteria

In most practical scenarios, we have observed that setting the maximum round,

M

, to 5 is sufficient to meet user requirements. Yet, task complexity can greatly change the needed round count. From a resource conservation standpoint, it is crucial for the quality competitor to determine the appropriate moment to terminate the process. We benchmarked this with 20 user requirements, tracking the tool calls by generator agents.

We observed that without the implementation of stopping criteria, the generator agent necessitated an average of 20 tool calls over 5 rounds. However, when we employed the quality competitor’s judgment to determine early stoppage, the average number of tool calls reduced to 12 over approximately 3 rounds. Interestingly, we also found that the additional rounds and tool calls did not significantly enhance performance. This might be due to the feedback at this stage not providing explicit guidance on improving the results.

4.3 Ablation study

4.3.1 Coarse-to-fine tool usage

Given that current large language models such as GPT-3.5-turbo/GPT-4 [2,3] possess a finite context length, it presents a challenge for the planner to directly select proper parameters for calling tools while formulating the plan. We tackle this with a coarse-to-fine tool usage strategy. The planner inputs tool descriptions and selects tools, while the executor interprets their specific instructions. We find that a one-step plan carries an over 20% risk of producing an incorrect format. However, this risk diminishes to less than 10% when employing the hierarchical setting.

Taking the user request, “Enrich wooden frames to the photo and adjust the longer side to 512” as an example, we compare different tool usages in Fig.4. Analysis reveals both plans appear fitting, but InstructDiffusion in the first can’t manage size changes directly. In comparison, the hierarchical tool usage breaks the task into subtasks for a logical plan. This highlights the first approach’s inadequate understanding of tool usage.

Furthermore, we carried out an experiment to verify whether Tool Executor can adjust parameters under the hierarchical tool usage. For a clearer observation, the toolset is restricted to a single tool,

I n s t r u c t D i f f u s i o n

. We observed that as the number of rounds increases, the Tool Executor gradually increases one of the key parameters

txt-cfg

(text classifier-free guidance), as depicted in Fig.5. It’s apparent that the initial value (

txt-cfg = 4

) does not fulfill the requirements and it progressively grows to

txt-cfg = 8

. This observation further underscores the pivotal role that feedback can play in hierarchical tool usage.

4.3.2 Effect of collaboration and competition

Our system relies on collaboration for efficient, high-quality planning, and competition to enhance generated results and boost system robustness. We tested this across four settings in 100 instances. a) Our full model devoid of both collaboration and competition, b) our model excluding collaboration, c) our model excluding competition, and d) our full model. We compare the first three settings against the final setting, prompting users to determine which results they perceived as superior. The results are demonstrated in Fig.6. Both strategies contribute to improved outcomes, with the best result achieved by incorporating both elements.

4.4 Comparison

We benchmark our CCA against techniques like InstructPix2Pix [23], MagicBrush [54], InstructDiffuion [41], and VisProg [67]. The comparison results, as depicted in Fig.7, indicate that all previous works struggle to manage such complex cases. Despite VisProg’s ability to dissect intricate tasks, it fails to execute edits in alignment with user requirements. This underscores the value of using agents cooperatively and competitively. Additionally, we also perform a quantitative comparison based on human preferences, with the results included in the supplementary material.

4.5 Discussion

Our goal is to propose a universal agent-based framework to address complex user requirements rather than focusing on a particular task. Image editing is our first significant achievement on the path toward this goal. The CCA framework can also be applied to other tasks such as text-to-image generation. We leave this application in the supplementary material. Within our proposed CCA framework, GPT-4 plays a significant role in planning and reflection. However, our method also performs well when utilizing GPT-3.5 Turbo as an alternative. It demonstrate that our method is robust to the choice of LLMs. Related results will also be included in the supplementary material.

5 Conclusion

This paper introduces a novel generative framework, Collaborative Competitive Agents (CCA), that employs multiple LLM-based agents to tackle intricate image editing challenges in practice. The key strength of the CCA system lies in its capacity to decompose complex tasks using multiple agents, resulting in a transparent generation process. This enables agents to learn from each other, fostering collaboration and competition to fulfill user requirements. The study’s primary contributions entail the proposition of a new multi-agent-based generative model, an examination of the relationships between multiple agents, and extensive experimentation in the area of image editing. This work represents a step forward in AI research, with the potential to influence various fields.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	OpenAI. Gpt-4v(ision) system card. See api.semanticscholar.org/CorpusID:263218031} website, 2023

[2]	OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, , . Gpt-4 technical report. 2023, arXiv preprint arXiv: 2303.08774

[3]

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiiano P, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2011

[4]	Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, Lample G. LLaMA: open and efficient foundation language models. 2023, arXiv preprint arXiv: 2302.13971

[5]	Touvron H, Martin L, Stone K, Albert P, Almahairi A, , . Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288

[6]	Yao S, Chen H, Yang J, Narasimhan K. WebShop: towards scalable real-world web interaction with grounded language agents. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1508

[7]	Qian C, Cong X, Liu W, Yang C, Chen W, Su Y, Dang Y, Li J, Xu J, Li D, Liu Z, Sun M. Communicative agents for software development. 2023, arXiv preprint arXiv:2307.07924

[8]	Swan M, Kido T, Roland E, dos Santos R P. Math agents: computational infrastructure, mathematical embedding, and genomics. 2023, arXiv preprint arXiv: 2307.02502

[9]	Kalvakurthi V, Varde A S, Jenq J. Hey Dona! Can you help me with student course registration? 2023, arXiv preprint arXiv: 2303.13548

[10]	Park J S, O’Brien J, Cai C J, Morris M R, Liang P, Bernstein M S. Generative agents: interactive simulacra of human behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 2023, 2

[11]	Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2014, 2672–2680

[12]	Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[13]	Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 4401–4410

[14]	Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Analyzing and improving the image quality of StyleGAN. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020

[15]	Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 574

[16]	Song Y, Sohl-Dickstein J, Kingma D P, Kumar A, Ermon S, Poole B. Score-based generative modeling through stochastic differential equations. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[17]	Dhariwal P, Nichol A. Diffusion models beat GANs on image synthesis. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 672

[18]	Karras T, Aittala M, Aila T, Laine S. Elucidating the design space of diffusion-based generative models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1926

[19]	Podell D, English Z, Lacey K, Blattmann A, Dockhorn T, Müller J, Penna J, Rombach R. SDXL: improving latent diffusion models for high-resolution image synthesis. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[20]	Nichol A Q, Dhariwal P. Improved denoising diffusion probabilistic models. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8162–8171

[21]	Hang T, Gu S, Geng X, Guo B. Improved noise schedule for diffusion training. 2024, arXiv preprint arXiv: 2407.03297

[22]	Wang T, Yang Q, Wang R, Sun D, Li J, Chen Y, Hu Y, Yang C, Kimura T, Kara D, Abdelzaher T F. Fine-grained control of generative data augmentation in IoT sensing. In: Proceedings of the 38th Annual Conference on Neural Information Processing Systems. 2024

[23]	Brooks T, Holynski A, Efros A A. InstructPix2Pix: learning to follow image editing instructions. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 18392–18402

[24]	Hertz A, Mokady R, Tenenbaum J, Aberman K, Pritch Y, Cohen-Or D. Prompt-to-prompt image editing with cross-attention control. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[25]	Meng C, He Y, Song Y, Song J, Wu J, Zhu J Y, Ermon S. SDEdit: guided image synthesis and editing with stochastic differential equations. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[26]	Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. Cambridge: MIT Press, 2018

[27]	Xi Z, Chen W, Guo X, He W, Ding Y, . . The rise and potential of large language model based agents: a survey. Science China Information Sciences, 2025, 68( 2): 121101

[28]	Weng L. LLM powered autonomous agents. See Lilianweng.github.io website, 2023

[29]

Deng Q, Yang Q, Yuan R, Huang Y, Wang Y, Liu X, Tian Z, Pan J, Zhang G, Lin H, Li Y, Ma Y, Fu J, Lin C, Benetos E, Wang W, Xia G, Xue W, Guo Y. ComposerX: multi-agent symbolic music composition with LLMs. In: Proceedings of the 25th International Society for Music Information Retrieval Conference. 2024

[30]	Schick T, Dwivedi-Yu J, Dessì R, Raileanu R, Lomeli M, Hambro E, Zettlemoyer L, Cancedda N, Scialom T. Toolformer: language models can teach themselves to use tools. In: Proceedings of the 36th Annual Conference on Neural Information Processing Systems. 2023

[31]	Wu Y, Yang X . A glance at in-context learning. Frontiers of Computer Science, 2024, 18( 5): 185347

[32]	Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1800

[33]	Yao S, Zhao J, Yu D, Du N, Shafran I, Narasimhan K R, Cao Y. ReAct: synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[34]

Madaan A, Tandon N, Gupta P, Hallinan S, Gao L, Wiegreffe S, Alon U, Dziri N, Prabhumoye S, Yang Y, Gupta S, Majumder B P, Hermann K, Welleck S, Yazdanbakhsh A, Clark P. SELF-REFINE: iterative refinement with self-feedback. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2019

[35]	Yang Z, Wang J, Li L, Lin K, Lin C C, Liu Z, Wang L. Idea2img: iterative self-refinement with GPT-4V for automatic image design and generation. In: Proceedings of the 18th European Conference on Computer Vision. 2025

[36]	Shen Y, Song K, Tan X, Li D, Lu W, Zhuang Y. HuggingGPT: solving AI tasks with ChatGPT and its friends in hugging face. In: Proceedings of the 37th Annual Conference on Neural Information Processing Systems. 2023

[37]

Driess D, Xia F, Sajjadi M S M, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W, Chebotar Y, Sermanet P, Duckworth D, Levine S, Vanhoucke V, Hausman K, Toussaint M, Greff K, Zeng A, Mordatch I, Florence P. PaLM-E: an embodied multimodal language model. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 340

[38]	Li G, Hammoud H A A K, Itani H, Khizbullin D, Ghanem B. CAMEL: communicative agents for ”mind” exploration of large language model society. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2264

[39]	Chen W, Su Y, Zuo J, Yang C, Yuan C, Chan C M, Qin Y, Lu Y, Hung Y H, Qian C, Qin Y, Cong X, Xie R, Liu Z, Sun M, Zhou J. AgentVerse: facilitating multi-agent collaboration and exploring emergent behaviors. 2023, arXiv preprint arXiv: 2308.10848

[40]	Chan C M, Chen W, Su Y, Yu J, Xue W, Zhang S, Fu J, Liu Z. ChatEval: towards better LLM-based evaluators through multi-agent debate. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[41]	Geng Z, Yang B, Hang T, Li C, Gu S, Zhang T, Bao J, Zhang Z, Li H, Hu H, Chen D, Guo B. InstructDiffusion: a generalist modeling interface for vision tasks. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024

[42]	Hang T, Yang H, Liu B, Fu J, Geng X, Guo B . Language-guided face animation by recurrent styleGAN-based generator. IEEE Transactions on Multimedia, 2023, 25: 9216–9227

[43]	Mokady R, Hertz A, Aberman K, Pritch Y, Cohen-Or D. Null-text inversion for editing real images using guided diffusion models. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 6038–6047

[44]	Johnson J, Alahi A, Fei-Fei L. Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 694–711

[45]	Gatys L A, Ecker A S, Bethge M. Image style transfer using convolutional neural networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2414–2423

[46]	Gu S, Chen C, Liao J, Yuan L. Arbitrary style transfer with deep feature reshuffle. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 8222–8231

[47]	Ding Z, Li P, Yang Q, Li S, Gong Q. Regional style and color transfer. In: Proceedings of the 5th International Conference on Computer Vision, Image and Deep Learning. 2024, 593–597

[48]	Zhu J Y, Park T, Isola P, Efros A A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision. 2017, 2223–2232

[49]	Isola P, Zhu J Y, Zhou T, Efros A A. Image-to-image translation with conditional adversarial networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1125–1134

[50]	Bertalmio M, Sapiro G, Caselles V, Ballester C. Image inpainting. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. 2000, 417–424

[51]	Criminisi A, Perez P, Toyama K. Object removal by exemplar-based inpainting. In: Proceedings of 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2003

[52]	Sun J, Yuan L, Jia J, Shum H Y. Image completion with structure propagation. In: Proceedings of the ACM SIGGRAPH 2005 Papers. 2005, 861–868

[53]	Yang B, Gu S, Zhang B, Zhang T, Chen X, Sun X, Chen D, Wen F. Paint by example: exemplar-based image editing with diffusion models. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 18381–18391

[54]	Zhang K, Mo L, Chen W, Sun H, Su Y. MagicBrush: a manually annotated dataset for instruction-guided image editing. In: Proceedings of the 37th Annual Conference on Neural Information Processing Systems. 2023

[55]	Xia W, Zhang Y, Yang Y, Xue J H, Zhou B, Yang M H . GAN inversion: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45( 3): 3121–3138

[56]	Shen Y, Gu J, Tang X, Zhou B. Interpreting the latent space of GANs for semantic face editing. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 9243–9252

[57]	Zhu J, Shen Y, Zhao D, Zhou B. In-domain GAN inversion for real image editing. In: Proceedings of the 16th European Conference on Computer Vision. 2020, 592–608

[58]	Patashnik O, Wu Z, Shechtman E, Cohen-Or D, Lischinski D. StyleCLIP: text-driven manipulation of StyleGAN imagery. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 2085–2094

[59]	Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748–8763

[60]	Gu S, Chen D, Bao J, Wen F, Zhang B, Chen D, Yuan L, Guo B. Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 10696–10706

[61]	Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 10684–10695

[62]	Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M. Hierarchical text-conditional image generation with CLIP latents. 2022, arXiv preprint arXiv: 2204.06125

[63]

Saharia C, Chan W, Saxena S, Lit L, Whang J, Denton E L, Ghasemipour S K S, Ayan B K, Mahdavi S S, Gontijo-Lopes R, Salimans T, Ho J, Fleet D J, Norouzi M. Photorealistic text-to-image diffusion models with deep language understanding. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2643

[64]	Balaji Y, Nah S, Huang X, Vahdat A, Song J, Zhang Q, Kreis K, Aittala M, Aila T, Laine S, Catanzaro B, Karras T, Liu M Y. eDiff-I: text-to-image diffusion models with an ensemble of expert denoisers. 2022, arXiv preprint arXiv: 2211.01324

[65]	Wallace B, Gokul A, Naik N. EDICT: exact diffusion inversion via coupled transformations. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 22532–22541

[66]	Ruiz N, Li Y, Jampani V, Pritch Y, Rubinstein M, Aberman K. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 22500–22510

[67]	Gupta T, Kembhavi A. Visual programming: compositional visual reasoning without training. In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, 14953–14962

[68]	Liu H, Li C, Li Y, Lee Y J. Improved baselines with visual instruction tuning. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024

[69]	Schuhmann C. Improved aesthetic predictor, 2022. GitHub repository

[70]	Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, Jiang Q, Li C, Yang J, Su H, Zhu J, Zhang L. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In: Proceedings of the 18th European Conference on Computer Vision. 2025