Quality control in open-ended crowdsourcing: a survey

Lei CHAI; Hailong SUN; Jing ZHANG

doi:10.1007/s11704-025-41081-1

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (6) :2006330 DOI: 10.1007/s11704-025-41081-1

Artificial Intelligence

REVIEW ARTICLE

Quality control in open-ended crowdsourcing: a survey

Lei CHAI ¹
, Hailong SUN ¹^,²^,^†
, Jing ZHANG ¹

Author information +

History +

PDF (2089KB)

Abstract

Crowdsourcing provides a flexible approach for leveraging human intelligence to solve large-scale problems, gaining widespread acceptance in domains like intelligent information processing, social decision-making, and crowd ideation. However, the uncertainty of participants significantly compromises the answer quality, sparking substantial research interest. Existing surveys predominantly concentrate on quality control in Boolean tasks, which are generally formulated as simple label classification, ranking, or numerical prediction. Ubiquitous open-ended tasks like question-answering, translation, and semantic segmentation have not been sufficiently discussed. These tasks usually have large to infinite answer spaces and non-unique acceptable answers, posing significant challenges for quality assurance. This survey focuses on quality control methods applicable to open-ended tasks in crowdsourcing. We propose a two-tiered framework to categorize related works. The first tier presents a comprehensive overview of the quality model, covering essential aspects including tasks, workers, answers, and the system. The second tier further refines this classification by breaking it down into more detailed categories: ‘quality dimensions’, ‘evaluation metrics’, and ‘design decisions’. This breakdown provides deeper insights into the internal structure of the quality control model for each aspect. We thoroughly investigate how these quality control methods are implemented in state-of-the-art works and discuss key challenges and potential future research directions.

Graphical abstract

Keywords

crowdsourcing / open-ended tasks / quality control

Cite this article

Download citation ▾

Lei CHAI, Hailong SUN, Jing ZHANG. Quality control in open-ended crowdsourcing: a survey. Front. Comput. Sci., 2026, 20(6): 2006330 DOI:10.1007/s11704-025-41081-1

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

Posting questions online and receiving answers from other users become much easier for internet users in the era of modern web technologies. In this context, internet participants are afforded considerable freedom to provide open-ended responses for a variety of open-ended tasks. The term ‘open-ended’ refers to the nature of not being confined to specific answers or solutions, allowing contributors to offer diverse and creative inputs. Specifically, in open-ended tasks or questions, there are no pre-set answer options that restrict participants’ responses. Contributors can freely express detailed, insightful, or innovative answers based on their understanding, experience, or creativity. As a result, open-ended crowdsourcing has been widely adopted for collecting data [1] for artificial intelligence models that require large-scale labeled datasets such as text translation [2–4], image segmentation [5], image captioning [6], and object recognition [7–12]. Open-ended crowdsourcing also enables the solution of complex tasks that are not possible yet to solve through automatic methods. For instance, workers in open-ended crowdsourcing tasks collaborate to write novels [13], solve mysteries [14], and create taxonomies [15].

In addition to the potential in solving complex tasks, there are several reasons why open-ended crowdsourcing is of great significance. First, it allows for the acquisition of fine-grained information. Compared to Boolean crowdsourcing [16,17], open-ended crowdsourcing is an efficient way to introduce more detailed information to the computing system. For instance, image segmentation, a typical open-ended task, is crucial for scene understanding. Rather than assigning a single label to an entire image, the segmentation task essentially requires fine-grained annotations from crowd workers to identify the location and segmentation boundaries of sub-targets within an image through pixel-level labeling. Second, it is cost-effective. For example, given a set of 3D-point cloud data, employing a few experienced experts to annotate the target objects is costly and inefficient, while posting open-ended crowdsourcing tasks on crowdsourcing platforms is more effective in terms of time and cost.

As a result, both industry and academia have utilized open-ended crowdsourcing to tackle complex tasks due to its capabilities. For instance, in terms of advancing computer vision technology, crowd workers are employed to provide fine-grained image information by annotating the category of interrelated pixel points in the image, such as object boundaries. A notable accomplishment in this field is Visual Genome [18], which has become a critical benchmark for related research and applications (e.g., image segmentation [19] and Autopilot system [20]). In scientific research, citizen participants are generally asked to provide open-ended answers to scientific questions. For instance, in disease diagnosis tasks, diagnostic reports with different concerns and diverse language styles may be collaboratively generated by crowd workers [21]. Consequently, citizen science projects have addressed numerous challenges in areas such as astronomical observation [22], protein folding [23], biological population analysis [24], and disease diagnosis [21]. For knowledge production, over 80 million Wikipedia users have collaborated to write nearly 200 million entries [25] in various expressions, writing styles and even languages. Therefore, open-ended crowdsourcing is emerging as an indispensable approach for accomplishing large-scale complex tasks with low cost and high efficiency.

Consequently, the core issue in open-ended crowdsourcing is how to enhance the quality of responses through statistical modeling or mechanism design. Given that the participants for open-ended crowdsourcing are dynamically recruited from the Internet, crowd workers have uncertain capabilities [26,27] and motivations [28,29]. Their motivation, ability level, and time commitment vary greatly and may be influenced by various factors like time, environment, and psychological state, thus posing serious quality problems in answers [30,31].

1.1 Representative open-ended crowdsourcing tasks

Considering the prevalence and diversity of open-ended crowdsourcing tasks in real-world scenarios, we categorize the tasks of open-ended crowdsourcing into three types: ‘intelligent information processing’, ‘crowd social decision-making’, and ‘crowd ideation’. We will present some typical open-ended crowdsourcing tasks, which we use as examples in this paper.

1) Intelligent information processing. In this paper, we primarily concentrate on information-processing tasks that can be used for supervised learning. Open-ended crowdsourcing tasks for intelligent information processing can be bifurcated into two categories: 1. Direct tasks that employ crowdsourced answers as solutions, such as optical character recognition [32], animal audio classification [33]; 2. Indirect tasks that are associated with knowledge acquisition: these tasks indirectly provide solutions by extracting and organizing knowledge, such as knowledge mapping [34] and knowledge base construction [35]. Simple crowdsourcing tasks like numerical prediction, binary classification and ranking provide labeled data for the training and deployment of AI models, thereby accelerating the implementation and application of AI algorithms in areas like face recognition [36], intelligent customer service bots [17], and pedes-trian recognition [37]. As AI technology continues to evolve, emerging applications such as autonomous driving [38], smart cities [39], and metaverse [40] have created new demand for labeled data. Some representative open-ended tasks of intelligent information processing include:

● Image segmentation. Image segmentation [5] is generally regarded as a pixel-level classification task that focuses on subdividing an image into multiple sub-regions (or collections of pixels) by identifying objects and boundaries (such as lines and curves) within the image.

● Open-ended question answering. The objective of open-ended question answering is to collect responses to specific questions from the crowd [41]. The peculiarity of such tasks lies in the fact that the answer can be subjective and heavily influenced by the worker’s capability.

● Translation. Translation tasks aim to request crowd workers to provide translations in the target language [2], based on the sentences in the source language.

2) Crowd social decision-making. Crowd social decision-making is an effective approach to solving specific social issues, such as drawing on the opinions and preferences of the crowds [42], finding social influencers [43], and acquiring specific knowledge [44].

3) Crowd ideation. The main challenge in crowd ideation is to assemble distributed groups of workers with minimal mutual interaction [45] while generating creative content, like stories [46,47], plans [48], and summaries [49–51].

To avoid redundancy, we have selected one representative work for each type of task (such as image segmentation and audio tran-scription) and summarized them in Tab.1.

From the perspective of data types, both open-ended crowdsourcing tasks and Boolean crowdsourcing primarily involve images, videos, audio, and numerical data. However, open-ended crowdsourcing often demands finer-grained responses, indicating a higher requirement for data processing. For instance, Boolean crowdsourcing usually requires only appropriate category labels for specific images, whereas open-ended tasks might necessitate drawing 3D bounding boxes around specific objects or describing scene content. Regarding audio data, Boolean crowdsourcing typically involves determining whether an audio clip contains malicious content, while open-ended tasks may require transcribing the audio content. In the case of text-based tasks, Boolean crowdsourcing may simply require assessing the sentiment of a given text, whereas open-ended tasks might demand workers to provide comprehensive event descriptions from textual fragments.

Consequently, open-ended crowdsourcing employs more refined and diverse evaluation metrics to assess the answer quality. For example, Boolean crowdsourcing frequently uses ‘hard metrics’ such as ‘accuracy’, whereas open-ended tasks tend to utilize ‘soft metrics’ centered on similarity, like GLEU [70], Jaccard Similarity [71], and Intersection over Union (IoU) [52]. Additionally, open-ended crowdsourcing incorporates subjective metrics, including peer ratings [72], expert feedback [73], and ranking [58], to evaluate answer quality.

Beyond the probabilistic graphical models commonly used in Boolean crowdsourcing (such as the EM algorithm [55]), open-ended crowdsourcing involves a wider range of methods for directly optimizing answer quality, such as similarity-based methods (e.g., DICE Similarity Coefficient [74]), optimization-based strategies (e.g., multi-objective optimization [75]), and machine learning-based techniques (e.g., deep knowledge tracing [76]).

Given that the scale of the answer significantly impacts the design of answer modeling and answer aggregation methods, we have organized and summarized the size of answer space (number of acceptable answers) for open-ended tasks and divided it into three categories: countable, large but countable and large and uncountable.

We found that simple Boolean crowdsourcing is not suitable for most open-ended tasks. For instance, an emergency event cannot be adequately described with finite category labels or continuous values. Therefore, exploring modeling and optimization methods for open-ended crowdsourcing is both crucial and challenging.

1.2 Quality control in open-ended crowdsourcing

Most previous research on optimizing crowdsourcing has concentrated on Boolean tasks. However, according to AMT log data [77] and crowdsourcing industry user surveys [78], the demand for open-ended tasks in the crowdsourcing market has surpassed traditional simple crowdsourcing tasks and continues to rise.

Quality control methods (like D&S [79], ZenCrowd [80], BCC [81], and LFC [82]) for Boolean crowdsourcing have been widely explored over the past years. However, open-ended tasks have been overlooked due to their inherent characteristics like complex task structure, large to infinite answer space and non-unique acceptable answers. For instance, the authors of ImageNet [83] argue that ‘computers still perform poorly on cognitive tasks such as image description and question-answering’, and propose an image dataset ‘Visual Genome’ [18] with complex annotations including attribute annotation, semantic segmentation, image description, and question-answering. While both ImageNet [83] and VisualGenome [18] aim to identify the image content, annotation tasks of VisualGenome impose more fine-grained requirements: for example, indicating the relationship between the objects in the image; describing the content of the image in open-ended text; designing question-answer pairs to explore the semantic information in the image. In the case of VisualGenome [18], crowd workers may need to provide complex statements that come from infinite answer space. In summary, there are general differences between Boolean crowdsourcing and open-ended crowdsourcing.

Consider that ‘quality’ has diverse connotations in different research areas such as data management, artificial intelligence, and software engineering. In this paper, we adopt the definition of quality proposed by Crosby [84] as a guide to identify the target and dimensions of the quality control issue, where ‘conformance to requirements’ is regarded as the core requirement. Therefore, ‘quality control’ primarily focuses on how to produce high-quality answers to meet the needs of the requester. Broader implications of quality control, such as security and reliability, are beyond the scope of this paper. In the realm of crowdsourcing quality control, requesters typically specify a series of objective quality metrics, such as accuracy, recall rate, and semantic similarity. The problem of quality control in open-ended crowdsourcing is defined as follows:

Definition 1 Quality control in open-ended crowdsourcing

Given a set of open-ended tasks

T

with corresponding quality metrics

Q

published by requesters, a group of publicly recruited crowd workers

W

capable of providing answers, and a set of answers

A

provided by these crowd workers. Quality control in open-ended crowdsourcing involves enhancing answer quality through a collection of statistical modeling and design methods

Θ

. The goal is to ensure that the final answer set

A ′

meet the requesters’ quality requirements as specified by

Q

. These methods include but are not limited to optimizing the design of crowdsourcing tasks, enhancing the capabilities of crowd workers, and eliminating noise in crowdsourced answers.

Specifically, the quality control issue in open-ended crowdsourcing focuses on optimizing methods for tasks with no exact evaluation metrics and unique ground truth, implying that there may be more than one acceptable answer. The answers for these tasks generally come from a large to infinite answer space, making it uncommon for different workers to provide identical answers to the same task. In addition, these tasks may consist of a set of interdependent subtasks, so that context information and worker collaboration may significantly contribute to the improvement of answer quality.

1.3 Challenges to quality control in open-ended crowdsourcing

Given that the quality control methods in Boolean crowdsourcing generally make certain assumptions [85] that do not exist in open-ended crowdsourcing: 1) the candidate answers are independent of each other, 2) answers come from a finite answer space, and 3) there are exact evaluation metrics, so a unique true answer exists for a specific task. Examples of Boolean crowdsourcing and open-ended crowdsourcing tasks are depicted in Fig.1. In the object classification task depicted in Fig.1, crowd workers are asked to choose the most appropriate answer from a list of five options. Conversely, for the image captioning task illustrated in Fig.1, workers provide an open-ended textual description of the visual content. As a result, most existing quality control methods are only applicable to Boolean crowdsourcing tasks, like rating [86] and classification [87], and thus are not suitable for open-ended crowdsourcing tasks. To intuitively introduce the relationship between Boolean and open-ended crowdsourcing, we present the algorithm flow of crowdsourcing quality control, along with the similarities and differences between them, as shown in Fig.2. It is worth noting that the boxes in blue denote challenges that do not exist in Boolean crowdsourcing, while boxes in gray indicate challenges that commonly exist in both Boolean crowdsourcing and open-ended crowdsourcing but have distinct differences, like ‘dynamic and multi-dimensional worker abilities’.

Without loss of generality, the challenges of quality control in open-ended crowdsourcing lie in the following aspects:

1) Dynamic and multidimensional worker abilities. Accurately estimating worker abilities is crucial for assigning the task to the appropriate worker, thereby improving the quality of answers. Unlike Boolean crowdsourcing tasks, the answer quality of specialized open-ended crowdsourcing tasks (such as question answering, story creation and translation) is usually influenced by the knowledge background of crowd workers and has a certain degree of subjectivity. Therefore, crowd workers’ abilities for open-ended crowdsourcing tasks are multidimensional. For instance, it is important to evaluate whether the crowd workers have the background knowledge to complete the complex open-ended task, and how well they have mastered the knowledge, rather than representing the worker’s ability with a number ranging from 0 to 1. Moreover, changes in workers’ abilities over the course of the task are not negligible: workers become more adept at completing specific open-ended tasks as the number of completed tasks increases.

2) Complex task structure. Unlike simple micro-tasks, open-ended tasks usually consist of a set of interdependent subtasks organized by complex workflows. For instance, in an intelligence analysis task, crowd workers usually have access to only a portion of the information. Therefore, decomposing the complete task into interrelated subtasks allows crowd workers to access context information, which is important for producing comprehensive, high-quality intelligence analysis results.

3) Multiple acceptable answers. Many open-ended tasks like opinion gathering, story writing, and event reporting does not have unique acceptable answers. The variability of workers’ knowledge, skills and subjective biases leads to the diversity of answers. Beyond that, for open-ended crowdsourcing tasks, most crowd workers usually struggle to provide complete answers, leading to diversity and limitation of the answers. Since the final output is generally a consistent result obtained by aggregating participants’ answers, the diversity of crowd contributions makes it more challenging to derive a ‘ground truth’ with high confidence.

4) Difficulty in evaluation and aggregation. In Boolean crowdsourcing, probabilistic graphical model (PGM) [88]-based algorithms (like D&S [79] and its extensions [89]) are widely applied for answer evaluation and aggregation. However, such methods cannot be directly applied to open-ended crowdsourcing due to several reasons. First, diverse contributions: given that there may be several acceptable answers, it becomes challenging to accurately assess the contribution of each answer, thus posing challenges for answer aggregation. Second, uncertainty in answer representation: for automated evaluation, answers are generally mapped into numerical values or dense vectors. Different embedding methods (such as ‘Universal Sentence Encoder’ [90] and ‘BERT’ [91]) may focus on different aspects of a textual answer’s semantic information, derive different similarity relationships between answers and the truth, and thus impact the result of answer evaluation and aggregation. Third, significant differences across open-ended tasks make it challenging to find uniform evaluation and embedding methods.

In summary, issue

1

indicates the difficulty of worker estimation and task assignment; issue

2

leads to challenges in task analysis and workflow design; issues

3

and

4

point out problems in answer modeling and aggregation.

1.4 Related survey

Numerous surveys on crowdsourcing have been published in recent years. Some provide a general overview of crowdsourcing [92], while others focus on specific aspects such as 1) certain steps in the crowdsourcing process (e.g., worker organization [93] and task design [94]); 2) specific types of tasks (e.g., writing tasks [95] and medical image analysis [96]), and 3) specific research fields (e.g., computing vision [97] and natural language processing [85]).

Recently, with the increased focus on quality control issues, surveys on crowdsourced quality control have begun to emerge. For instance, Li et al. [98] summarized 17 truth inference algorithms in crowdsourcing to assist users in understanding existing truth inference algorithms and conducting detailed comparison experiments, providing readers with guidance for selecting appropriate methods for different tasks. Jin et al. [99] offered a unified taxonomy of the crowdsourcing quality control methods from the perspective of mechanism design and statistical methods. Daniel et al. [100] discussed this topic from the viewpoint of quality attributes, assessment techniques, and assurance actions. However, most of the quality control methods discussed in these surveys are only suitable for Boolean crowdsourcing tasks. For example, as a result, quality control methods for open-ended crowdsourcing tasks urgently need to be reviewed and discussed.

1.5 Contributions

This survey aims to elaborate on quality control issues in open-ended crowdsourcing. It seeks to assist requesters, workers, and developers of crowdsourcing applications in understanding the roles of quality control methods in the process of open-ended crowdsourcing and how to enhance answer quality through appropriate methods. Specifically, it provides the following contributions:

● A definition of open-ended crowdsourcing and the scope of quality control methods.

● An introduction to representative open-ended crowdsourcing tasks and the challenges in quality control.

● A two-tiered framework for all the related works. Tier 1 provides a holistic view of the quality model encompassing key aspects: task, worker, answer, and context. Tier 2 introduce ‘quality dimensions’, ‘evaluation metrics’, ‘design decisions’, and their interactions in each aspect.

● A systematic review of all related works, which are organized by the two-tiered framework.

● An in-depth discussion of research status and identification of current research challenges along with respective opportunities for future research, especially in the era of Large Language Models (LLMs).

2 Quality control in open-ended crowdsourcing

Before we introduce the quality control framework in detail, it is important to identify the execution process and key aspects involved in open-ended crowdsourcing. According to the task execution process adopted by representative crowdsourcing platforms (such as AMT, Scale AI, and Appen), we have summarized the organization and execution process of crowdsourcing tasks, and highlighted the key parts and crucial topics in quality control issues, as shown in Fig.2.

Generally, open-ended crowdsourcing tasks are executed by the flow of tasks and answers.

Task flow refers to the key steps in the organization and flow of tasks in crowdsourcing. Initially, macro open-ended tasks are decomposed into sub-tasks that are suitable for crowd workers to complete. Subsequently, the micro-tasks are assigned to appropriate workers (e.g., with the matched knowledge background or task experience). After that, micro-tasks are solved by crowd workers according to the task requirements.

Answer flow refers to the key process of answer generation and optimization. First, crowd workers provide the initial answers according to the requirements of micro-tasks. Then, the answer reliability is evaluated based on relevant parameters, such as answer error and worker reliability. After that, the optimized answer (also known as aggregated answer or estimated truth) is derived based on the answer and its reliability parameters.

2.1 Key aspects of quality control in open-ended crowdsourcing

Current research on open-ended crowdsourcing quality control is dedicated to exploring the relationship between key aspects and answer quality [99]. Therefore we specify four critical aspects of open-ended crowdsourcing related to the issue of quality control before we conduct the systematic review on this topic.

1) Task. In crowdsourcing, tasks are usually decomposed into easy-to-complete micro-tasks to facilitate crowd workers in providing high-quality answers. Generally, a micro-task is the smallest unit that is distributed to workers [101] in crowdsourcing. Open-ended crowdsourcing tasks can be categorized into ‘single task’ and ‘contextual task’ based on the task structure. A single task mainly asks for independent open-ended answers from workers. For example, annotating an object with a 3D frame in a 3D point cloud image. In contrast, a contextual task can usually be considered as a set of interdependent sub-tasks, which may require considering the context information of other sub-tasks when completing the task. For example, captioning video content based on the content of certain frames. Furthermore, the task may be subjective, where there are multiple acceptable answers.

2) Crowd worker. Crowd worker (hereafter termed as worker) denotes the users involved in answering questions with certain motivation, typically recruited dynamically from internet platforms in crowdsourcing. As discussed in Section 1.3, workers need to be modeled dynamically and multidimensionally in open-ended crowdsourcing. Multidimensional modeling and analysis of workers aim to assign the most appropriate task to each worker to produce high-quality answers.

3) Answer (also known as response). The responses to the task provided by workers are referred to as answers. In quality control, the final answer is generally derived by aggregating all the answers to the same task, and referred to as the ‘estimated truth’. Answers are generally collected from a large to infinite solutions space in open-ended crowdsourcing.

4) System. Quality control in open-ended crowdsourcing is a system-level problem that is typically discussed within the context of multiple quality dimensions. In this paper, it mainly consists of cross-aspect quality control methods and workflow design. We found that multiple quality dimensions are generally jointly considered while modeling and optimizing the answer quality. Workflow greatly aids in identifying the execution steps and organizing crowd workers in the process of solving open-ended tasks, especially for ‘contextual tasks’.

2.2 Holistic view of the quality model

Most existing open-ended crowdsourcing quality control studies aim to improve the answer quality through statistical modeling or design methods on one or more specific aspects (including task, worker, and answer) introduced above.

We propose a comprehensive view of quality control methods in open-ended crowdsourcing, as depicted in Fig.3. The quality model introduces the aspects, their related dimensions, and the design decisions considered in past research works for quality control methods designed in open-ended crowdsourcing. It is worth noting that a dimension denotes an important branch of the related aspect that has been jointly considered by numerous crowdsourcing quality control research works, a design decision signifies a specific class of research ideas for addressing quality control problems.

2.3 Taxonomy in each aspect

Considering that the current fragmented literature lacks an analysis of the fine-grained categories of research works, we developed a taxonomy shown in Fig.4 to study and analyze what is the scope of the quality control issue and how quality is dealt with in each aspect. The proposed taxonomy aims to highlight the hierarchical relationship between the related quality control works, which greatly assists readers in identifying the attributes, evaluation metrics and design decisions when developing quality control methods. Specifically, the taxonomy consists of three dimensions:

1) Quality model in each aspect refers to a collection of quality dimensions, which are used to describe key attributes that may impact the quality of crowdsourcing tasks. Quality dimensions are typically an abstract representation of a category of methods and issues related to an aspect, such as task mapping and task organization. For each aspect, understanding the quality dimensions can help readers find an appropriate entry point when they explore to optimize crowdsourcing quality.

2) Quality evaluation aims to measure the impact of specific quality dimensions and optimization methods on the quality of crowdsourcing results. Accurate evaluation results are crucial for designing effective quality control methods [102]. Quality assessment is generally implemented by a series of evaluation metrics. For automatic metrics, ‘annotation distance’ [4] and ‘answer similarity’ [14] are often considered general metrics for answer assessment; other automated evaluation metrics, like ‘GLEU’ [70], and ‘Intersect Over Union (IoU)’, are generally adopted for textual and image data. Beyond that, subjective evaluation is also a common evaluation method in open-ended crowdsourcing, such as peer grading [103], expert feedback [68], and human-AI collaborative approach [104].

3) Quality optimization is the final step in open-ended crowdsourcing quality control, aimed at improving the final quality of crowd-sourced answers. Specifically, researchers optimize the quality of each stage of the crowdsourcing task by making design decisions that act on the quality dimensions defined by the quality model, such as designing incentive mechanisms to improve worker motivation and designing task decomposition and reconstruction methods to reduce task difficulty.

Specifically, the quality model defines the standards and criteria for quality. Quality evaluation measures and assesses outcomes against these standards, while quality optimization enhances processes and results to improve quality levels. Together, the quality model, quality evaluation, and quality optimization form the foundation of quality control methods within each aspect. We have organized the remainder of the research works according to the proposed quality model and taxonomy. Each section corresponds to one aspect in the quality model, with ‘quality dimensions’, ‘evaluation metrics’, and ‘design decisions’ reviewed.

2.4 Literature selection

We compiled a list of conferences and journals that publish research works on crowdsourcing and related topics to identify the papers to be considered in this survey. The conferences considered include AAAI, IJCAI, WWW, KDD, SIGIR, HCOMP, ICML, NIPS, ICLR, CHI, CSCW, VLDB, CIKM, WSDM, UBICOMP, UIST, CVPR, EMNLP, ACL, ICDE, ICIP, DASFAA, and NAACL. The journals include ACM CSUR, VLDBJ, JAIR, AIJ, IEEE TKDE, IJCV, TPAMI, IEEE TNNLS, TVCG, ACM TOCHI, ACM TOCS, ACM TWEB, ACM TIST, and Information Sciences. We conducted a search for papers published between 2012 and January 2025 with the following keywords in either the title, abstract or keywords: Complex Crowdsourcing, Open-ended Crowdsourcing, Crowd transcription, Crowd Translation, Crowd Image description, and Complex annotation. The search returned 451 papers and the authors read the title and abstract of each paper. Papers were excluded from further consideration if they focused on simple crowdsourcing tasks (like binary classification) or if they did not involve the topic of quality control, which resulted in a list of 153 papers for further consideration. Additional papers considered are derived from the authors’ prior knowledge.

3 Task model

Tasks, which are usually posted by requesters on public internet platforms, serve as the starting point of crowdsourcing. In open-ended crowdsourcing, tasks can be categorized into two types: objective tasks and subjective tasks. Objective tasks typically have clear evaluation criteria and a limited answer space. These tasks are usually more complex compared to Boolean crowdsourcing, requiring more human cognitive effort and sophisticated task design to yield high-quality answers.

3.1 Quality dimensions

A significant number of researchers in crowdsourcing quality control focus on the topic of task modeling and organizing. The method of task construction and implementation directly influences the quality of the answers provided by crowd workers, and significantly impacts the quality of open-ended crowdsourcing. We identify the following quality dimensions:

1) Task design is the starting point in task modeling, and the results determine the content and interface of the task, which has a direct impact on the difficulty of crowd workers in completing tasks. Therefore, task design is a crucial step in task modeling. Considering that open-ended crowdsourcing tasks usually have complex task structures and annotation content, the process task design mainly includes steps such as task decomposition and reconstruction [105], task pricing [106], communication mechanisms design [107], and pipeline/workflow design [64].

2) User interface is where crowd workers complete tasks and submit answers. The interface can either be an HTML integrated into the crowdsourcing platform (such as the AMT task dashboard [108]) or a separate annotation application (such as Labelme [109]). These user interfaces commonly integrate annotation tools and are connected by a series of processes. Therefore, the quality of the user interface determines how easily crowd workers can complete tasks [56,110].

3) Incentives. Within the context of task design, incentive mechanisms play a pivotal role [110]. Incentives provided by requesters through the platform—ranging from extrinsic rewards (e.g., monetary compensation) to intrinsic rewards (e.g., recognition and prestige)—serve as critical motivators for crowd workers. The architecture of these incentives directly influences worker engagement and diligence in task completion, thereby significantly impacting the quality and accuracy of the crowdsourced outcomes. Consequently, crafting an effective incentive framework during the task design phase is essential for optimizing task performance and ensuring high-quality results.

There are also quality dimensions related to task modeling that may necessitate interaction between aspects. For instance, the quality dimension ‘task allocation’ typically requires an estimation of the worker’s ability and knowledge background [111], in addition to modeling the content and difficulty of the task itself. We will provide a comprehensive introduction to these quality dimensions with cross-aspect interactions in Section 6.

3.2 Quality evaluation

The primary purpose of quality evaluation for task modeling is to evaluate the quality of task organization and design, with the aim of guiding the design of crowdsourcing tasks and thereby enhancing the quality of crowdsourced answers. Considering that numerous researchers explore measuring the quality of crowdsourcing task design through the quality of crowdsourced answers, we categorize the quality evaluation methods of task modeling into two types: direct and indirect. Moreover, considering the prevalent existence of subjective tasks in open-ended crowdsourcing and the subjectivity characteristic of task quality, the evaluation of task design should be considered by both objective indicators (such as numbers of operations [54]) and subjective metrics (such as crowd feedback).

1) Objective evaluation. Number of automated metrics are utilized to objectively evaluate the quality of task modeling. On one hand, some researchers strive to determine whether the designed crowdsourcing tasks are easy to complete by analyzing the cost required to complete crowdsourcing tasks, such as monetary expenses [64,112], time [113,114], number of operations [54], and completion rate [63]. On the other hand, statistical metrics about quality, such as average error rate [69,115], number of operations for target accuracy [9], and engagement probability [116] are frequently used to indirectly evaluate the quality of task design. Furthermore, some automated indicators that reflect the participation of crowd workers are also used to measure the quality of task modeling indirectly. For instance, Rechkemmer and Yin [115] measure the extent to which the design of task instructions can enhance the ability of workers to complete tasks through the learning gain indicator. Additionally, mental demand scores [107] are utilized to identify the difficulty of the designed tasks by measuring the mental cost of crowd workers in completing specific tasks.

2) Subjective evaluation. Given that the answer to ‘the quality of task modeling’ often varies among users, many researchers explore to subjectively evaluate the quality of task modeling by collecting feedback from crowd workers. For example, some researchers directly measure the quality of task design by asking workers to provide comments [47,105] on the dimensions of usability, helpfulness, clarity, and conciseness for task and interface designs. In addition, rating is also a common method to help workers provide subjective evaluations. Evaluation methods such as peer rating [103], expert rating [56,117], and self-rating [112] are generally adopted for the assessment of task design and answer quality.

3.3 Quality control methods

According to the quality dimensions and evaluation metrics introduced above, the quality control method of task modeling is shown in Fig.5, including the following three aspects:

3.3.1 Task mapping

Requesting high-quality responses directly from novice crowds for open-ended tasks is challenging and costly, primarily due to the significant cognitive burden and workload involved. An intuitive approach to task simplification is to transform complex crowdsourcing tasks into simple annotating tasks. For instance, complex scene recognition tasks are transformed into 2D annotation tasks [7] and numerical annotation tasks [52] with crowd-machine hybrid approaches. Additionally, some works explore transforming crowdsourcing tasks into binary annotation tasks through multi-step planning [48] and disambiguation [118]. Furthermore, meaning representation [62] and reconstruction [52] are also adopted to simplify complex open-ended crowdsourcing tasks.

3.3.2 Workflow design

The workflow design is one of the most critical parts in task modeling, and the logic of task flow directly influences the efficiency of task execution and the efficiency of crowd workers. Specifically, the ideas of workflow design may originate from the crowd [117,119], generic workflow patterns [27], and budget constraints [64]. The tasks may be implemented according to the results of automatic workflow optimization [120–122], multi-step planning [48], instant execution [123], data programming [124], and visualized management [125]. Moreover, pricing is also an important factor in task and workflow design given that crowdsourcing is a profit-oriented computing paradigm. Compared to a quality-based pricing strategy [126], the proposed methods of the error-time curve [127] and dynamic financial incentive [128] are more helpful in adapting to changes in worker capabilities as complex tasks are performed.

3.3.3 Task organization

Appropriate methods of task organization significantly impact the overall quality and efficiency of the macro-crowdsourcing task, especially since open-ended tasks typically comprise a set of interdependent subtasks. For instance, in multi-step crowdsourcing tasks like intelligence analysis and event reporting, the sequencing and organization of the steps critically influence the final output. Numerous research works have discussed the organizational relationships between subtasks from both synchronous [129,130] and asynchronous [14,103,131] perspectives. Additionally, some researchers consider the problem of task organization as an optimization problem and explore the optimal method of task organization using state machine [132] and collaboration optimization [133].

4 Worker model

Modeling and analyzing crowd worker behavior is crucial for ensuring quality control in crowdsourcing. The strategy for evaluating and selecting workers directly impacts the quality of their contributions. Worker ability, a pivotal factor for task assignment and answer aggregation, significantly influences answers. Specifically, the worker accuracy probability

c t w

, frequently employed in prior research [80,87] quantifies worker

w

’s capability to accurately process task

t

. A higher

c t w

indicates a signifies an increased likelihood of worker

w

providing accurate answers for task

t

. Additionally, the confusion matrix [81,89], widely utilized in classification-based annotation tasks, models the probability of each worker generating a specific categorical annotation given the correct answer. For instance, in a

K

-option task

t

, the confusion matrix

c o n w

for worker

w

is a

K

×

K

matrix where the

j

th row (1

≤

≤

c o n j w = [c o n j, 1 w, c o n j, 2 w, . . ., c o n j, k w]

, represents the distribution of probabilities that worker

w

assigns to each possible answer if the true answer is the jth option. Each element

c o n j, m w

≤

≤

k, 1

≤

≤

k) indicates the probability that worker

w

selects the

m

th choice when the true answer is the

j

th one, which can be formulated as follows:

(1)

c j, m w = P r (a w = m ∣ a t r u e = j) .

In open-ended crowdsourcing, tasks usually possess a complex structure and necessitate certain professional knowledge. Completing such complex tasks requires not only a redundant processing mechanism for individual subtasks but also the collaboration of multiple participants. Therefore, enhancing the quality of open-ended crowdsourcing through worker-centered modeling and optimization method design primarily focus on the following key issues: how to accurately describe multi-dimensional dynamic worker capabilities; how to improve the expertise of crowd workers to obtain high-quality answers; and how to design incentive mechanisms to boost the subjective initiative of workers.

4.1 Quality dimensions

The quality dimensions of worker modeling primarily focus on the strategies employed for worker evaluation and selection, which directly influence the quality of responses and significantly impact the completion quality of open-ended crowdsourcing tasks. This section discusses quality control approaches in worker modeling, with an emphasis on expertise modeling, motivational design, and contribution analysis.

1) Expertise modeling. Expertise directly determines whether crowd workers can provide correct answers for specific tasks. Workers’ expertise is generally influenced by diverse factors including skills, educational background, task experience, age, gender, etc., and exhibits different characteristics under different motivations and work states. Modeling worker expertise not only helps assign crowdsourcing tasks to suitable workers but can also be used to assess the quality of answers and measure workers’ contributions. In general, expertise modeling helps optimize the quality of open-ended crowdsourcing at multiple stages of the crowdsourcing workflow.

2) Worker motivating. Self-Determination Theory (SDT) [134] indicates that human behavior is influenced by both intrinsic and extrinsic motivations. For crowdsourcing workers, intrinsic motivations include interest in the task and personal achievement, whereas extrinsic motivations may involve monetary rewards and reputation enhancement. Consequently, designing effective incentive mechanisms is crucial for obtaining high-quality crowdsourced responses. Workers with malicious intent or low motivation are more prone to delivering subpar or incorrect responses. Therefore, the design of robust incentive schemes is crucial for enhancing workers’ dedication and focus during task execution, which ultimately contributes to the integrity and reliability of the data produced.

3) Contribution analyzing. Unlike Boolean crowdsourcing, estimating the contribution of a crowd worker to completing a specific open-ended task is very challenging. This challenge primarily stems from two aspects: 1) Open-ended crowdsourcing tasks may consist of a series of interrelated subtasks, crowd workers may complete one or more subtasks, thus resulting in a lack of complete information about the task for the crowd workers. 2) Open-ended crowdsourcing may include subjective tasks, and the answers provided by workers could be partially correct or conditionally correct. Consequently, assessing the contribution of crowd workers is crucial for task allocation, response aggregation, and reward determination in open-ended crowdsourcing.

4.2 Quality evaluation

Existing research on worker evaluation can be broadly categorized into confidence analysis, capability modeling, and contribution estimation.

Due to the complexity of open-ended tasks, exact matching usually cannot meet the needs of its correctness evaluation. As a result, some researchers have proposed to estimate the confidence of workers through partial agreement [60] and behavior analysis [27,135,136] in open-ended crowdsourcing tasks.

4.2.1 Expertise evaluation

Expertise evaluation (also referred to as capability modeling or skill modeling) is another significant topic in worker modeling. Considering that the state and ability of workers in the process of handling tasks are changing, researchers explored estimating worker capability at a fine-grained [137] and dynamic [27] level through multidimensional skill modeling. In particular, representative approaches of skill modeling methods include Single-Skill Modeling [138], Multi-Skill Modeling [139], Skill Ontology Modeling [140], and Cognitive-based Modeling [141].

4.2.2 Contribution estimation

As discussed in Section 1.3, the consistent results are difficult to infer due to the diversity of crowd contributions in open-ended crowdsourcing. Existing research works on contribution assessment for open-ended crowdsourcing can be divided into two categories, crowd-based [142] methods and answer quality-based methods. The crowd-based methods evaluate the contribution of answers based on feedback from other novice crowds [143] and domain experts [105]. Other works mainly focus on assessing the answer contribution with answer similarity [3] or annotation distance [4].

4.3 Quality control methods

Numerous research topics in worker modeling, such as worker evaluation and teaching, are pertinent to both Boolean and open-ended crowdsourcing. However, the design decisions vary significantly due to the difference in the dimension of worker expertise and worker organization. As shown in Fig.6, we categorize the quality control methods for worker modeling into the following categories.

4.3.1 Performance enhancement

Compared to Boolean crowdsourcing tasks, obtaining high-quality answers directly from novice crowd workers for open-ended tasks is usually challenging due to the requirement of substantial cognitive effort and domain-specific knowledge. Consequently, numerous researchers have explored methods to improve the performance of crowd workers (especially novice crowds) to enhance the quality of answers for open-ended crowdsourcing tasks, including:

1) Teaching. Teaching serves as a method to enable novice workers to tackle complex crowdsourcing tasks. Rechkemmer et al. [29,115] proposed to set different goals (including performance goals, learning goals, and behavioral goals) for worker teaching to significantly improve the workers’ ability to handle complex crowdsourcing tasks. Furthermore, researchers have discovered that crowd workers can effectively learn from diverse types of feedback due to humans’ inherent perception and learning abilities, as evidenced by findings from the research area of machine teaching [26,144]. Therefore, feedback emerges as another crucial factor in enhancing worker performance when dealing with complex open-ended crowdsourcing tasks. A summary of related work indicates that feedback from multiple sources could have a teaching effect on novice crowd workers, such as multi-turn contextual argumentation [145], peer communication [72], gated instruction [146], and automatic conversational interface [56].

2) Knowledge supplementary. In addition to worker teaching, numerous researchers have proposed auxiliary methods to enhance the level of information exposure of workers. These methods include addressing workers’ task-related questions through online discussion [51,147] and question answering [148], displaying golden examples to workers [112], presenting multidimensional task information to workers through data visualization [149], exposure to contextual and global information by distributed information synthesis and evaluation [49,150].

3) Motivating. For creative open-ended tasks like crowd writing and product design, it is crucial to develop mechanisms and tools that provide on-the-spot assistance when the workers get stuck, in addition to monetary [128] incentives. For instance, Kaur et al. [58] proposed vocabulary-based planning to motivate workers in complex writing tasks, Wang et al. [93] proposed a reputation management-based approach to organize and motivate workers, Huang et al. [47] proposed an in-situ ideation method for crowd writing tasks.

4.3.2 Worker collaboration

Collaboration plays a significant role in completing tasks with complex structures and heavy cognitive load. Numerous research works have explored the design of collaborative mechanisms to enable novice crowds to accomplish large-scale open-ended tasks. Collaboration methods can be categorized into worker-worker-based and worker-machine-based collaboration methods depending on different collaboration subjects.

1) Worker-worker collaboration. Worker collaboration helps to integrate diverse knowledge and skills from the Internet to accomplish open-ended crowdsourcing tasks. It also facilitates the communication of contextual information between the interdependent sub-tasks. Consequently, researchers are exploring ways to enhance the answer quality through worker collaboration. For instance, researchers have explored improving worker communication with the mechanical design of Advisor-to-Advisee [58] and context communication [107]; others have designed feedback mechanisms to achieve worker collaboration through Iteration Feedback [151], Feedback Prompting [73], Crowd Reviewing [103], and Self-Correction [152].

2) Worker-machine collaboration. Humans possess the ability to understand complex situations and exhibit creative thinking, while machines excel at efficient data processing and precise execution. Consequently, human-machine collaboration often significantly enhances the quality and efficiency of completing large-scale, complex tasks [153,154]. For example, computer algorithms methodically solve large-scale problems with well-defined iterative steps, while humans excel at solving problems requiring cognitive and reasoning skills [155]. In summary, three types of worker-machine collaboration approaches have been designed to enhance answer quality in open-ended crowdsourcing. First, worker explanation is used to improve the interpretability [156] of machine judgment. Additionally, researchers have explored aggregating the two types of answers directly [7,55,157,158] and iteratively [8,9,114] to improve the answer quality. Moreover, worker intelligence is injected into the intelligent system as expert knowledge through Rationale [159], Debug Information [160,161], and Predict Punishment [162].

5 Answer model

Modeling and optimizing answers are crucial for ensuring the quality of collected answers. The taxonomy of answer-centered quality control methods is shown in Fig.7. The key topic “answer aggregation” (also known as “truth inference”) is considered a core issue in crowdsourcing quality control and significantly impacts the quality of the final answer.

Unlike Boolean crowdsourcing, open-ended crowdsourcing presents several challenges for answer evaluation: 1) Answers may be partially correct (e.g., in sequence annotation tasks, some words in a sentence are annotated correctly). 2) There may be more than one answer that can be considered correct. 3) Some subjective evaluation tasks (such as preference collection) may have obvious biases and ambiguities in answers.

5.1 Quality dimensions

Considering that the evaluation results of answers can be used as evidence for deriving the truth in crowdsourcing. Therefore, a reasonable answer embedding and reliability estimating are the key dimensions in answer modeling.

5.1.1 Answer embedding

The core of answer embedding is to choose a feature space for the answer, which significantly impacts subsequent answer understanding and processing methods. For example, image annotation tasks generally ask crowd workers to detect objects with bounding boxes, primary content such as the vertex coordinates of the bounding box are generally adopted as the answer to identify the position, size, and interrelationships of the objects. Additionally, the pixel information within the bounding boxes is usually used as the answer embedding, semantic information in the area of bounding boxes becomes the main content conveyed by the answer.

5.1.2 Answer reliability

The reliability of an answer refers to the likelihood of an answer being correct. Accurately estimating the answer quality plays an indispensable role in inferring ground truth and evaluating worker contributions. Considering the complex task structure and subjective characteristics that are commonly found in open-ended crowdsourcing, the corresponding answer evaluation methods may include both objective metrics and subjective methods (such as peer grading [163], group evaluation [143], and expert feedback [63].

5.2 Quality evaluation

Given the complex structure and various evaluation criteria of answers in open-ended crowdsourcing, numerous researchers have attempted to explore a variety of evaluation methods to assess the quality of the answers as comprehensively as possible. ‘Answer Understanding’, ‘Feedback’, and ‘Automatic Metric’ are the representative research topics.

5.2.1 Answer understanding

An in-depth understanding of the semantic information in open-ended answers form the basis for accurately evaluating answer quality. On one hand, many researchers explored eliciting knowledge [61], and representing the meaning with human rationale [62] froms open-ended answers to fully capture the rich knowledge and semantic information contained. On the other hand, some researchers have tried to enhance the quality of open-ended answers by eliminating the subjectivity [16], bias [75,164], and ambiguity [118,165] prevalent in open-ended answers to clarify the meaning.

5.2.2 Feedback

Feedback serves as another method to evaluate answer quality, taking into account the cognitive differences between participants on a specific task and effectively avoids quality problems due to subjectivity and bias. For instance, DEXA [143] provides expert feedback by incorporating dynamic examples from experts to crowd workers; Xu et al. [105] aim to improve the quality of visual design by prompting crowd workers to generate structured feedback; Zhu et al. [103] explore hiring crowd workers to perform answer reviewing to provide feedback.

5.2.3 Automatic metric

The use of automatic metrics is the most convenient way to assess the answer quality in Boolean crowdsourcing. Several automatic metrics have been proposed for evaluating the relationship between open-ended answers. For example,

G L E U

[70] and

B L E U

[166] are commonly used to evaluate the semantic similarity between sentences;

I O U

[167] is frequently used to evaluate the accuracy of the image bounding box. However, these automatic metrics are usually related to specific application scenarios and answer types. Therefore, a series of automatic metrics are proposed to evaluate variety types of open-ended answers, like Soft Evaluation Metric [168], Similarity-based Metric [4,169], Hybrid Reliability [3], bMultiple-Criteria Preference [59], Neural Network-based Score Function [170], and Distribution Estimation [171].

5.3 Quality control method: answer aggregation

Answer aggregation is the most important topic in crowdsourcing quality control. As previously discussed, modeling the rich semantic information in open-ended answers is important for the answer aggregation in open-ended crowdsourcing. Therefore, numerous researchers consider capturing the agreement between answers through similarity-based evaluation metrics (as discussed in Section 5.2) based on the idea of majority Voting, and design agreement-based answer aggregation algorithms. The agreement may take different forms and require different modeling approaches for different tasks. For example, Li et al. [42,172] propose aggregating pairwise preference with pairwise similarity; Braylan et al. [4] explore selecting the best answer with the minimum error which is estimated with annotation distance; Li et al. [2,3] evaluate the answer reliability dynamically according to iterative agreement estimation.

Additionally, Dumitrache et al. [168] have found that disagreement is prevalent in open-ended crowdsourcing, which may arise from workers, tasks or annotation schemes. Specifically, ambiguous or unclear [173] points in a task may lead to disagreement between answers. To address this challenge, Uma et al. propose a disagreement-based computational framework [53,168] to train high-quality models directly from answers with disagreement; Braylan et al. eliminate disagreement in answers by the idea of merging and matching [52]; Timmermans et al. mitigate the semantic gap between answers provided by crowd workers with context modeling [174]; Researchers explore to reduce the disagreement among answers by removing highly ambiguous samples through item filtering [175] and label noise correction [169,176,177].

Moreover, in order to integrating fine-grained information about tasks and workers into the answer aggregation process [178,178], some researchers have tried to integrate specific knowledge by incorporating context information [42,179] to improve the quality of answers. For complex annotation tasks that require domain knowledge (e.g., medical image annotation). Zlabinger et al. [143] proposed to incorporate expert knowledge to improve the quality of answers provided by novice crowds. In addition, some researchers have explored improving the quality of answer aggregation by integrating answer features, such as pixel proximity [180] and text features [164].

6 System

Quality dimensions, evaluation approaches, and quality control methods of different aspects have been separately introduced in previous sections to help readers gain a holistic view of the quality control issue for open-ended crowdsourcing. However, the issue of quality control in open-ended crowdsourcing is a system-level problem that is typically discussed within the context of multiple quality aspects. Multiple quality dimensions are generally considered jointly while modeling and optimizing the answer quality. For instance, estimating the reliability of answers usually requires a coordinated consideration of quality dimensions relevant to diverse aspects, such as task difficulty and worker expertise.

In this section, we will discuss how existing works improve the quality of open-ended crowdsourcing by exploring the interaction of aspects (especially the joint consideration of different quality dimensions) and designing joint optimization methods, in conjunction with the three core tasks in the crowdsourcing execution process: ‘task allocation’, ‘answer aggregation’, and ‘workflow design’.

6.1 Task assignment: jointly consideration of task design and worker expertise

Task assignment [101,181], a critical topic of quality control in open-ended crowdsourcing, which is generally discussed in the context of ‘task modeling’ and ‘worker expertise’. According to Hettiachchi et al. [181], the problem of task assignment is generally defined as follows:

Definition 2 Task assignment

Given a set of questions

Q

= {

q 1

q 2

,...,

q k

} for a specific task t and a set of workers

W

= {

w 1

w 2

,...,

w m

} where

| | Q | |

k

and

| | W | |

m

. The process of task allocation aims to assign crowdsourcing tasks to the most suitable workers, which may significantly affect the answer quality and completion efficiency.

For Boolean crowdsourcing, parameters such as task categories, task difficulty, worker expertise, worker population, and worker practices are generally inferred for the matching of tasks and workers. Given the complex cooperative relationship between crowd workers and the dependencies between subtasks in open-ended crowdsourcing, more fine-grained worker evaluation and task modeling are required for task assignment. For example, Cui et al. [137] proposed a PNRN-based multi-round allocation strategy that can dynamically select the allocation strategy and predict the trend of a worker’s reputation. Additionally, He et al. [182] modeled the complex collaborative crowdsourcing task assignment problem as a combinatorial optimization problem based on the maximum flow and computed the optimal solution to task assignment with a Slide-Container Queue (SCP). Beyond that, similar to Boolean crowdsourcing, the utilization of historical worker data [183], answer distribution [184,185], gold questions test [65], priority queue scheduling [186], and behavioral data [27] are also generally adopted in open-ended crowdsourcing for better worker estimation and task assignment.

6.2 Answer aggregation: Jointly consideration of task difficulty, worker expertise, and answer quality

As discussed in Section 5, answer aggregation stands a pivotal topic within the topic of crowdsourcing quality control. It typically involves the joint optimizing of ‘task difficulty’, ‘worker expertise’, and ‘answer quality’. The challenge of answer aggregation is generally defined as inferring the true answer

v i ∗

for each task

t i

∈

T

given workers’ answers

V

. In generally, the task difficulty is quantified by a real number [187], which has been considered highly relevant to the answer quality. For example, Whitehill et al. [188] suggest that the relationship between task difficulty, worker expertise and answer quality can be depicted by the following equation:

(2)

P r (v i w = v i ∗ ∣ d i, c w) = 1 1 + e − c w d i,

where

c w

∈

(0,+

∞

) denotes the worker ability,

d i

∈

(0, +

∞

) denotes the task difficulty

t i

. A higher

d i

represents the higher task difficulty. It can be inferred that for a fixed worker ability

c w

, an easier task (with a lower

d i

) will lead to a higher probability that a worker correctly answers the task. Furthermore, numerous researchers explore modeling the task information with K-dimensional vectors [185,189–191], which is helpful in deriving the level of task difficulty.

As discussed in Section 5.3, most existing methods of answer aggregation (such as PGM and optimization-based methods) cannot be directly applied to open-ended crowdsourcing tasks due to the difficulty in measuring dynamic worker expertise and answer reliability. Researchers have investigated incorporating fine-grained information into the process of truth inference, aiming to accurately model the relationships among task difficulty, worker capability, and the quality of answers. A detailed presentation is provided in Section 5.3, and will not be reiterated in this section.

6.3 Workflow design: Jointly consideration of task design, worker collaboration, and human mental model

Workflow, considered as the dominant infrastructure in crowdsourcing today, is dedicated to decomposing complex macro-tasks into small independent tasks and coordinating worker resources. As discussed in Sections 3 and 4, numerous researchers propose improving the level of worker collaboration through the workflow design of interactive argumentation [145] or action planning [58]; other parts of researchers explore improving the efficiency of task organization with the strategy of multi-round allocation [137] and maximum-flow optimization [182]. In general, the methods of workflow design in existing works are usually closely related to the task type, and few works discuss the underlying theory. We explore workflow design in open-ended crowdsourcing, guided by the representative cognitive psychology theories ‘collaborate sensemaking’ and ‘cognitive process theory’, as depicted in Fig.8.

1) Cognitive process theory of writing. The theory of ‘cognitive process’ is usually considered as the fundamental theory describing cognitive changes in human writing, which can inspire the design of open-ended text tasks, including machine translation, summarization, Q&A, knowledge production, novel creation, event reporting, etc. However, quality control methods used in Boolean crowdsourcing (such as MV and PGM) cannot be applied to crowd-writing tasks directly due to the challenges discussed in Section 1. As a result, writing is a representative class of open-ended crowdsourcing tasks, and studying the workflow design theory of crowd-writing can great help explore the research direction of the quality control methods in open-ended crowdsourcing.

The cognitive process model [95,192] outlines the general guideline of crowd writers, from idea generation to final text completion, as a series of components, including ‘Topic-related Design’, ‘Worker Identification’, ‘Long-Term Memory’, ‘Goal Setting’, ‘Growing Answer Management’, and ‘Reviewing and Monitoring’. Each component of the cognitive process model represents a step of the crowd-writing task, the action and decision in each step directly affect the efficiency and quality of the writing output. Furthermore, Kraut et al. [193] found that the cognitive process model aligns with the process of collaborative writing, which forms the core of workflow design in crowd-writing.

2) Collaborative sensemaking. As discussed in Section 4.3.2, collaboration is a crucial method to integrate diverse knowledge and skills among crowd workers, which is of greatly help to improve the answer quality in open-ended crowdsourcing. However, there is limited discussion on the theoretical foundations for collaborative workflow design, which are vital for persuasiveness and interpretability.

Sensemaking is defined as the process of searching for a representation and encoding data in that representation to answer task-specific questions [194]. Pirolli et al. [195] introduce the sensemaking loop as a “broad brush description” of experts’ cognitive process of information transformation. Ericsson et al. [196] argued that the key to the expert performance of knowledge-driven open-ended tasks (such as intelligence analysis and crowd writing) is to develop domain-specific schemas from long-term memory and patterns around the key aspects of their tasks. Consequently, numerous researchers [14] have explored the use of sensemaking as the basic theory for designing workflows that enable distributed novice crowds to solve large-scale cognitive tasks. This topic of workflow design is also referred to as ‘Collaborative Sensemaking’.

For instance, Bradel et al. [197] proposed a co-located collaborative visual analytics tool for use on large, high-resolution vertical displays. Drieger et al. [198] proposed a semantic network-based method to support knowledge building, analytical reasoning and explorative analysis. These proposals have significantly improved crowdsourcing workflow design, with solutions such as biclusters-based semantic edge bundling [199] and knowledge-transfer graphs [200].

In addition, there exist other basis theories that can be used to guide crowdsourcing workflow design, such as gamification theory [201–204] and incentive theory [205,206] (including monetary incentives [128], worker motivating [58], etc.).

7 Summary and discussion

In this survey, we begin by introducing representative open-ended crowdsourcing tasks in the context of AI applications. We then summarize and compare the dimensions of quality control research on Boolean crowdsourcing and open-ended crowdsourcing tasks, considering four important aspects, including task, worker, answer, and workflow. We found that compared to Boolean crowdsourcing, the quality control research of open-ended crowdsourcing necessitates a more detailed discussion. The dimensions of quality control research on open-ended crowdsourcing significantly exceed those of Boolean crowdsourcing tasks (as shown in Fig.2, Boxes in blue did not exist in boolean crowdsourcing, while boxes in gray represent the distinct differences between Boolean crowdsourcing and complex open-ended crowdsourcing). Therefore, conducting a review of quality control methods on open-ended crowdsourcing is both necessary and challenging.

We propose a two-tiered framework that encompasses all research on open-ended crowdsourcing quality control. The first tier provides a holistic view of quality control methods, based on the execution process and components in crowdsourcing. This enables readers to understand the key aspects and main characteristics of open-ended crowdsourcing. The second tier analyzes the internal structures of quality control modeling in each aspect from the dimensions of quality models, quality assessment, and quality optimization. This helps readers identify the attributes, evaluation metrics, and design decisions when developing a quality control method, as illustrated in Fig.4.

7.1 Challenges and outlook

According to our survey, existing research on open-ended crowdsourcing quality control has several challenges. First, most quality control methods are applicable to single or a few types of tasks, and few studies have been conducted on generic quality control theories and methods. For instance, Lee et al. [19] proposed a quality control method for crowd image segmentation. However, this quality control method cannot be applied to other open-ended crowdsourcing tasks, like Q&A annotation, due to the differences in task structure and prediction algorithms. To the best of our knowledge, only a few works focus on discussing the cross-task approaches of answer aggregation [4,52] and answer evaluation [207] for open-ended crowdsourcing.

Another important challenge is that human intelligence has not been fully utilized in all steps in the AI cycle, with most research focusing on improving the quality of labeled data. However, according to Yang et al. [208], human intelligence shows great potential to address the robustness and interpretability issues of AI systems, which may significantly improve the quality of AI applications.

Given the research trends and challenges demonstrated in related literature, we believe that future work related to the following topics will have a long-term impact on the research of open-ended crowdsourcing:

1) Domain-specific design. Current crowdsourcing platforms typically assume that requesters have the sufficient domain knowledge to design reasonable crowdsourcing tasks, corresponding annotation tools and quality control methods. This approach can achieve acceptable results when dealing with some Boolean crowdsourcing tasks. However, this approach often fails to produce satisfactory answers when facing complex open-ended tasks in specific domains, such as medical image recognition, complex 3D point cloud annotation, due to the lack of background knowledge or effective annotation tools. Therefore, domain-specific design for typical tasks in specific domains is a key approach to completing open-ended crowdsourcing tasks and obtaining high-quality answers. For instance, Appen AI launched a AI Data Platform (ADAP) that supports a variety of annotation tools for 2D images and 3D point clouds, to cater to the high-quality annotation data requirements of intelligent cockpits and autonomous driving. Additionally, Wang et al. [54] focused on designing specific quality control algorithms for image segmentation to improve the data quality for autonomous driving systems.

2) General quality control framework. Considering that open-ended crowdsourcing generally involve tasks of different data types such as images, videos, and text, the quality control methods may vary significantly due to differences in task structure and answer types. Therefore, most proposed quality control methods are only applicable to a specific task or a few similar tasks. To the best of our knowledge, only a few works [4,52,207] have begun to explore quality control methods applicable to open-ended tasks in various data types. Considering the myriad of open-ended tasks in the real world, general quality control methods are both challenging and critically important.

3) Intelligent workflow design. Most current research topics attempt to optimize the completion quality of open-ended crowdsourcing from individual steps. In contrast, as the infrastructure of crowdsourcing, the crowdsourcing workflow can often consider the quality issue from a macro perspective, coordinating and optimizing each link of the task (such as demand analysis, answer collection, task assignment, and answer aggregation) to optimize the overall quality and efficiency. On the one hand, the cognitive process model [192] and collaborative perception theory [195] will serve as theoretical foundations to explore what kind of workflow is more in line with the human mental model. On the other hand, artificial intelligence technology will be fully involved in every link of the crowdsourcing task, bringing deep optimization to the output of open-ended crowdsourcing. For example, in the stage of demand analysis, LLMs-based generative artificial intelligence applications (like GPT-4 [209]) can transform the requester’s colloquial requirements into specific annotation specifications that can be directly implemented in the process.

7.2 Quality control strategy of representative crowdsourcing platform for open-ended tasks

Quality control is paramount for crowdsourcing platforms. It ensures high-quality deliverables, fostering long-term client relationships; optimizes task allocations to minimize rework costs; incentivizes crowd workers to enhance response quality and community service levels, thereby elevating the platform’s reputation and competitive edge. Quality control strategies not only directly guarantee the quality of completed tasks but also support efficient platform operations and long-term growth from multiple perspectives. A range of general quality control strategies, including rating, voting, output agreement checks, performance-based bonuses, and task decomposition are widely employed by representative crowdsourcing platforms [100]. Additionally, some platforms implement sophisticated quality control methods tailored for complex tasks, significantly enhancing their capability to handle open-ended tasks.

For example, MTurk integrates advanced machine learning technologies with a trained team of investigators to detect fraudulent and abusive activities on its platform. Consequently, 99% of flagged suspicious activities are reviewed by investigators within 18 hours, while urgent feedback for complex tasks is reviewed within one hour [108]. Appen enhances the performance of crowd workers on open-ended tasks through targeted worker training and performance metric monitoring. As a result, data annotation speed has increased by an average of 30%, while error rates have decreased by more than 40% [210]. Beyond that, numerous crowdsourcing platforms such as Appen, MobileWorks, and Crowdcrafting have adopted large-model-based automated quality control strategies. These approaches significantly enhance the quality of final answers while substantially reducing time and cost expenditures [211].

7.3 Open-ended crowdsourcing in the Era of LLMs

Large Language Models (LLMs) have taken the world by storm. LLMs are capable of achieving remarkable results in a range of downstream tasks without gathering extensive task-specific labeled data or tuning model parameters, but merely through the instruction of examples. Given the significant reliance of quality control methods on artificial intelligence algorithms in open-ended crowdsourcing, exploring the relationship between LLMs and open-ended crowdsourcing is of great help in clarifying the key issues and research direction of quality control in open-ended crowdsourcing. This topic will be discussed from the following three aspects.

1) Optimizing LLMs through crowd annotations. Researchers at Open AI have incorporated human feedback into the training and optimization of large language models (such as InstructGPT [212]), to mitigate the dissemination of harmful or false outputs. This approach has significantly improved the reliability and safety of LLMs. In this process, crowd workers serve as evaluators, ranking model responses under various conditions to create a comparative dataset. This dataset is subsequently used to train a reward model that predicts which responses would be rated higher. The original LLM is then iteratively fine-tuned using reinforcement learning algorithms guided by the reward model. This method continues to be integral to ongoing research and development in LLMs (e.g., GPT-4 [209], Claude 3 [213], PaLM [214], Llama 3 [215], and Deepseek [216]). Although optimization techniques represented by RLHF [217] have successfully introduced crowd intelligence into computing systems, this paradigm still has obvious limitations. For example, the quality of annotations is determined by the expertise of the annotators, and the results of model optimization will also have corresponding quality problems. Crowdsourcing, as an important paradigm for introducing human feedback into computing systems, is very likely to provide important insights for future quality optimization methods and development processes of LLMs.

2) Inspiring the prompt strategies with design decisions in crowdsourcing. Wu et al. [218] discovered a high degree of similarity between LLM chains and crowdsourcing workflows. They share common motivations and face similar challenges, such as dealing with cascading errors that affect later stages [219] or synthesizing inconsistent contributions from crowd workers [220]. A series of recent studies have confirmed this phenomenon: some research has adopted strategies inspired by the crowdsourcing workflow for iteratively improving LLMs, such as self-consistency [221], self-inquiry [222], and self-reflection [223]. Furthermore, in balancing the quality and cost of data processing using LLMs, Parameswaran et al. [224] viewed LLMs as crowdsourcing workers and leveraged ideas from the literature on declarative crowdsourcing. This includes multiple prompting strategies, ensuring internal consistency, and exploring a mix of LLM and non-LLM methods, to make prompt engineering a more principled process and optimize the workflow of LLM-powered data processing.

3) Facilitate crowd workers with LLMs. Due to the powerful semantic understanding and language generation capabilities of LLMs, there is a noticeable trend in using LLMs to facilitate crowd workers in completing open-ended tasks. Specifically, Veselovsky et al. [225] found that 33%–46% of crowd workers used LLMs when completing crowdsourcing tasks. Additionally, Wu et al. [226,227] took LLMs as a simulation of human workers to construct crowdsourcing pipelines for addressing complex tasks, and further discuss the potential of training humans and LLMs with complementary skill sets. Beyond that, He et al. [228] proposed a two-step approach, ‘explain-then-annotate’ to enable LLMs to generate high-quality labels akin to outstanding crowdsourcing annotators by providing them with sufficient guidance and demonstrated examples. He et al. [229] indicate that GPT-4 outperforms online crowd workers in data labeling accuracy and the labeling accuracy could further increase when combining the crowd’s and GPT-4’s labeling strengths.

The aforementioned research works indicate that there is a deep interrelationship between crowdsourcing and LLMs, which brings new demand and inspiration to the study of crowdsourcing quality control: 1) It is necessary to explore new quality control mechanisms to clarify which answers are generated by LLMs and to enable researchers to exclude them or eliminate the contained biases before conducting analysis. 2) The powerful language understanding and image recognition capabilities of LLMs can significantly assist human participants in completing compelx annotation tasks. Consequently, designing novel collaborative strategies that integrate both human participants and LLMs may offer new possibilities for addressing large and complex tasks that have been challenging for crowdsourcing before. This approach has the potential to substantially expand the boundaries of what is achievable through crowdsourcing. 3) Human feedback remains a key step in ensuring the usability of LLMs. Appropriate quality control methods can help the model understand human preferences more accurately, ultimately making the output of LLMs better aligned with human needs.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Guo B, Wang Z, Yu Z, Wang Y, Yen N Y, Huang R, Zhou X . Mobile crowd sensing and computing: the review of an emerging human-powered sensing paradigm. ACM Computing Surveys (CSUR), 2015, 48( 1): 7

[2]	Li J, Fukumoto F. A dataset of crowdsourced word sequences: collections and answer aggregation for ground truth creation. In: Proceedings of the 1st Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP. 2019, 24−28

[3]	Li J. Crowdsourced text sequence aggregation based on hybrid reliability and representation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 1761−1764

[4]	Braylan A, Lease M. Modeling and aggregation of complex annotations via annotation distances. In: Proceedings of Web Conference 2020. 2020, 1807−1818

[5]	Cheng H D, Jiang X H, Sun Y, Wang J . Color image segmentation: advances and prospects. Pattern Recognition, 2001, 34( 12): 2259–2281

[6]	Chai L, Qi L, Sun H, Li J. Ra³: a human-in-the-loop framework for interpreting and improving image captioning with relation-aware attribution analysis. In: Proceedings of the 40th IEEE International Conference on Data Engineering. 2024, 330−341

[7]	Song J Y, Chung J J Y, Fouhey D F, Lasecki W S . C-reference: improving 2D to 3D object pose estimation accuracy via crowdsourced joint object estimation. Proceedings of the ACM on Human-Computer Interaction, 2020, 4( CSCW1): 51

[8]	Maninis K K, Caelles S, Pont-Tuset J, Van Gool L. Deep extreme cut: from extreme points to object segmentation. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 616−625

[9]	Sofiiuk K, Petrov I, Barinova O, Konushin A. F-BRS: rethinking backpropagating refinement for interactive segmentation. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 8620−8629

[10]	Jang W D, Kim C S. Interactive image segmentation via backpropagating refinement scheme. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 5292−5301

[11]	Kang X, Yu G, Li Q, Wang J, Li H, Domeniconi C. Semi-asynchronous online federated crowdsourcing. In: Proceedings of the 40th IEEE International Conference on Data Engineering. 2024, 4180−4193

[12]	Guan H, Song C, Zhang Z . GRAMO: geometric resampling augmentation for monocular 3D object detection. Frontiers of Computer Science, 2024, 18( 5): 185706

[13]	Kim J, Sterman S, Cohen A A B, Bernstein M S. Mechanical novel: crowdsourcing complex work through reflection and revision. In: Proceedings of 2017 ACM Conference on Computer Supported Cooperative Work and social Computing. 2017, 233−245

[14]	Li T, Luther K, North C . CrowdiA: solving mysteries with crowdsourced sensemaking. Proceedings of the ACM on Human-Computer Interaction, 2018, 2( CSCW): 105

[15]	Chilton L B, Little G, Edge D, Weld D S, Landay J A. Cascade: crowdsourcing taxonomy creation. In: Proceedings of SIGCHI Conference on Human Factors in Computing Systems. 2013, 1999−2008

[16]	de Alfaro L, Polychronopoulos V, Shavlovsky M. Reliable aggregation of Boolean crowdsourced tasks. In: Proceedings of the 3rd AAAI Conference on Human Computation and Crowdsourcing. 2015, 42−51

[17]	Chen C, Zhang X, Ju S, Fu C, Tang C, Zhou J, Li X. AntProphet: an intention mining system behind Alipay’s intelligent customer service bot. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 6497−6499

[18]	Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S, Fei-Fei L . Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123( 1): 32–73

[19]	Lee D J L, Das Sarma A, Parameswaran A G. Aggregating crowdsourced image segmentations. In: Proceedings of HCOMP 2018 Works in Progress and Demonstration Papers Track of the 6th AAAI Conference on Human Computation and Crowdsourcing. 2018

[20]	Rzadca K, Findeisen P, Swiderski J, Zych P, Broniek P, Kusmierek J, Nowak P, Strack B, Witusowski P, Hand S, Wilkes J. Autopilot: workload autoscaling at google. In: Proceedings of the 15th European Conference on Computer Systems. 2020, 16

[21]	Han J, Brown C, Chauhan J, Grammenos A, Hasthanasombat A, Spathis D, Xia T, Cicuta P, Mascolo C. Exploring automatic COVID-19 diagnosis via voice and symptoms from crowdsourced data. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 8328−8332

[22]	Fortson L, Masters K, Nichol R, Borne K D, Edmondson E M, Lintott C, Raddick J, Schawinski K, Wallin J. Galaxy zoo. In: Way M J, Scargle J D, Ali K M, Srivastava A N, eds. Advances in Machine Learning and Data Mining for Astronomy. New York: CRC, 2012, 213−236

[23]	Jumper J, Evans R, Pritzel A, Green T, Figurnov M, . . Highly accurate protein structure prediction with AlphaFold. Nature, 2021, 596( 7873): 583–589

[24]	Khare R, Good B M, Leaman R, Su A I, Lu Z . Crowdsourcing in biomedicine: challenges and opportunities. Briefings in Bioinformatics, 2016, 17( 1): 23–32

[25]	Glott R, Schmidt P, Ghosh R. Wikipedia survey–overview of results. United Nations University, Collaborative Creativity Group, 2010, 1158−1178

[26]	Wang Z, Sun H. Teaching active human learners. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 5850−5857

[27]	Gadiraju U, Demartini G, Kawase R, Dietze S . Crowd anatomy beyond the good and bad: behavioral traces for crowd worker modeling and pre-selection. Computer Supported Cooperative Work (CSCW), 2019, 28( 5): 815–841

[28]	Chen P, Sun H, Yang Y, Chen Z. Adversarial learning from crowds. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence. 2022, 5304−5312

[29]	Rechkemmer A, Yin M. Motivating novice crowd workers through goal setting: an investigation into the effects on complex crowdsourcing task training. In: Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing. 2020, 122−131

[30]	Zhou Z, Jin Y X, Li Y F . Rts: learning robustly from time series data with noisy label. Frontiers of Computer Science, 2024, 18( 6): 186332

[31]	Dedeoglu E, Kesgin H T, Amasyali M F . A robust optimization method for label noisy datasets based on adaptive threshold: adaptive-k. Frontiers of Computer Science, 2024, 18( 4): 184315

[32]	Mithe R, Indalkar S, Divekar N . Optical character recognition. International Journal of Recent Technology and Engineering (IJRTE), 2013, 2( 1): 72–75

[33]	Nanni L, Maguolo G, Paci M . Data augmentation approaches for improving animal audio classification. Ecological Informatics, 2020, 57: 101084

[34]	Wexler M N . The who, what and why of knowledge mapping. Journal of Knowledge Management, 2001, 5( 3): 249–264

[35]	Shin J, Wu S, Wang F, De Sa C, Zhang C, Ré C . Incremental knowledge base construction using DeepDive. Proceedings of the VLDB Endowment, 2015, 8( 11): 1310–1321

[36]	Zhao W, Chellappa R, Phillips P J, Rosenfeld A . Face recognition: a literature survey. ACM Computing Surveys (CSUR), 2003, 35( 4): 399–458

[37]	Gray D, Tao H. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In: Proceedings of the 10th European Conference on Computer Vision. 2008, 262−275

[38]

Levinson J, Askeland J, Becker J, Dolson J, Held D, Kammel S, Kolter J Z, Langer D, Pink O, Pratt V, Sokolsky M, Stanek G, Stavens D, Teichman A, Werling M, Sebastian Thrun S. Towards fully autonomous driving: systems and algorithms. In: Proceedings of 2011 IEEE Intelligent Vehicles Symposium (IV). 2011, 163−168

[39]	Yaqoob I, Hashem I A T, Mehmood Y, Gani A, Mokhtar S, Guizani S . Enabling communication technologies for smart cities. IEEE Communications Magazine, 2017, 55( 1): 112–120

[40]	Lee L H, Braud T, Zhou P Y, Wang L, Xu D, Lin Z, Kumar A, Bermejo C, Hui P . All one needs to know about metaverse: a complete survey on technological singularity, virtual ecosystem, and research agenda. Foundations and Trends® in Human-Computer Interaction, 2024, 18( 2−3): 100–337

[41]	Chai L, Sun H, Wang Z . An error consistency based approach to answer aggregation in open-ended crowdsourcing. Information Sciences, 2022, 608: 1029–1044

[42]	Li J. Context-based collective preference aggregation for prioritizing crowd opinions in social decision-making. In: Proceedings of ACM Web Conference 2022. 2022, 2657−2667

[43]	Arous I, Yang J, Khayati M, Cudré-Mauroux P. Opencrowd: a human-AI collaborative approach for finding social influencers via open-ended answers aggregation. In: Proceedings of Web Conference 2020. 2020, 1851−1862

[44]	Han T, Sun H, Song Y, Fang Y, Liu X . Find truth in the hands of the few: acquiring specific knowledge with crowdsourcing. Frontiers of Computer Science, 2021, 15( 4): 154315

[45]	Schmidt G B, Jettinghoff W M . Using amazon mechanical Turk and other compensated crowdsourcing sites. Business Horizons, 2016, 59( 4): 391–400

[46]	Kim J, Monroy-Hernandez A. Storia: summarizing social media content based on narrative theory using crowdsourcing. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. 2016, 1018−1027

[47]	Huang C Y, Huang S H, Huang T H K. Heteroglossia: in-situ story ideation with the crowd. In: Proceedings of 2020 CHI Conference on Human Factors in Computing Systems. 2020, 1−12

[48]	Deng Z, Xiang Y . Multistep planning for crowdsourcing complex consensus tasks. Knowledge-Based Systems, 2021, 231: 107447

[49]	Verroios V, Bernstein M. Context trees: crowdsourcing global understanding from local views. In: Proceedings of the 2nd AAAI Conference on Human Computation and Crowdsourcing. 2014, 210−219

[50]	Wang N C, Hicks D, Luther K . Exploring trade-offs between learning and productivity in crowdsourced history. Proceedings of the ACM on Human-Computer Interaction, 2018, 2( CSCW): 178

[51]	Zhang A X, Verou L, Karger D. Wikum: bridging discussion forums and wikis using recursive summarization. In: Proceedings of 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. 2017, 2082−2096

[52]	Braylan A, Lease M. Aggregating complex annotations via merging and matching. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 86−94

[53]	Inel O, Khamkham K, Cristea T, Dumitrache A, Rutjes A, van der Ploeg J, Romaszko L, Aroyo L, Sips R J. CrowdTruth: machine-human computation framework for harnessing disagreement in gathering annotated data. In: Proceedings of the 13th International Semantic Web Conference. 2014, 486−504

[54]	Wang B, Wu V, Wu B, Keutzer K. LATTE: accelerating LiDAR point cloud annotation via sensor fusion, one-click annotation, and tracking. In: Proceedings of 2019 IEEE Intelligent Transportation Systems Conference. 2019, 265−272

[55]	Branson S, Van Horn G, Perona P. Lean crowdsourcing: combining humans and machines in an online system. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6109−6118

[56]	Abbas T, Khan V J, Gadiraju U, Markopoulos P. Trainbot: a conversational interface to train crowd workers for delivering on-demand therapy. In: Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing. 2020, 3−12

[57]	Nguyen Q V H, Duong C T, Nguyen T T, Weidlich M, Aberer K, Yin H, Zhou X . Argument discovery via crowdsourcing. The VLDB Journal, 2017, 26( 4): 511–535

[58]	Kaur H, Williams A C, Thompson A L, Lasecki W S, Iqbal S T, Teevan J . Creating better action plans for writing tasks via vocabulary-based planning. Proceedings of the ACM on Human-Computer Interaction, 2018, 2( CSCW): 86

[59]	Baba Y, Li J, Kashima H. CrowDEA: multi-view idea prioritization with crowds. In: Proceedings of the 8th AAAI Conference on Human Computation and Crowdsourcing. 2020, 23−32

[60]	Hung N Q V, Viet H H, Tam N T, Weidlich M, Yin H, Zhou X . Computing crowd consensus with partial agreement. IEEE Transactions on Knowledge and Data Engineering, 2018, 30( 1): 1–14

[61]	Goncalves J, Hosio S, Kostakos V . Eliciting structured knowledge from situated crowd markets. ACM Transactions on Internet Technology (TOIT), 2017, 17( 2): 14

[62]	Michael J, Stanovsky G, He L, Dagan I, Zettlemoyer L. Crowdsourcing question-answer meaning representations. In: Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics. 2018, 560–568

[63]	Schmitz H, Lykourentzou I . Online sequencing of non-decomposable macrotasks in expert crowdsourcing. ACM Transactions on Social Computing, 2018, 1( 1): 1

[64]	Tran-Thanh L, Huynh T D, Rosenfeld A, Ramchurn S D, Jennings N R. Crowdsourcing complex workflows under budget constraints. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015, 1298−1304

[65]	Pavlichenko N, Stelmakh I, Ustalov D. CrowdSpeech and VoxDIY: benchmark datasets for crowdsourced audio transcription. 2021, arXiv preprint arXiv: 2107.01091

[66]	Lipping S, Drossos K, Virtanen T. Crowdsourcing a dataset of audio captions. In: Proceedings of Workshop on Detection and Classification of Acoustic Scenes and Events 2019. 2019, 139−143

[67]	Kaspar A, Patterson G, Kim C, Aksoy Y, Matusik W, Elgharib M. Crowd-guided ensembles: how can we choreograph crowd workers for video segmentation? In: Proceedings of 2018 CHI Conference on Human Factors in Computing Systems. 2018, 111

[68]	Deng D, Wu J, Wang J, Wu Y, Xie X, Zhou Z, Zhang H, Zhang X, Wu Y. Eventanchor: reducing human interactions in event annotation of racket sports videos. In: Proceedings of 2021 CHI Conference on Human Factors in Computing Systems. 2021, 73

[69]	Song J Y, Lemmer S J, Liu M X, Yan S, Kim J, Corso J J, Lasecki W S. Popup: reconstructing 3D video using particle filtering to aggregate crowd responses. In: Proceedings of the 24th International Conference on Intelligent User Interfaces. 2019, 558−569

[70]	Mutton A, Dras M, Wan S, Dale R. GLEU: automatic evaluation of sentence-level fluency. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2007, 344−351

[71]	Niwattanakul S, Singthongchai J, Naenudorn E, Wanapu S. Using of Jaccard coefficient for keywords similarity. In: Proceedings of International MultiConference of Engineers and Computer Scientists 2013. 2013, 380−384

[72]	Tang W, Yin M, Ho C J. Leveraging peer communication to enhance crowdsourcing. In: Proceedings of World Wide Web Conference. 2019, 1794–1805

[73]	Huang Y C, Huang J C, Wang H C, Hsu J. Supporting ESL writing by prompting crowdsourced structural feedback. In: Proceedings of the 5th AAAI Conference on Human Computation and Crowdsourcing. 2017, 71–78

[74]	Heim E, Seitel A, Andrulis J, Isensee F, Stock C, Ross T, Maier-Hein L . Clickstream analysis for crowd-based object segmentation with confidence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 12): 2814–2826

[75]	Barbosa N M, Chen M. Rehumanized crowdsourcing: a labeling framework addressing bias and ethics in machine learning. In: Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. 2019, 543

[76]	Piech C, Bassen J, Huang J, Ganguli S, Sahami M, Guibas L, Sohl-Dickstein J. Deep knowledge tracing. In: Proceedings of the 29th International Conference on Neural Information Processing Systems. 2015, 505−513

[77]	Marcus A, Parameswaran A . Crowdsourced data management: industry and academic perspectives. Foundations and Trends® in Databases, 2015, 6( 1−2): 1–161

[78]	Difallah D E, Catasta M, Demartini G, Ipeirotis P G, Cudre-Mauroux P. The dynamics of micro-task crowdsourcing: the case of amazon MTurk. In: Proceedings of the 24th International Conference on World Wide Web. 2015, 238−247

[79]	Dawid A P, Skene A M . Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979, 28( 1): 20–28

[80]	Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web. 2012, 469−478

[81]	Kim H C, Ghahramani Z. Bayesian classifier combination. In: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics. 2012, 619−627

[82]	Raykar V C, Yu S, Zhao L H, Valadez G H, Florin C, Bogoni L, Moy L . Learning from crowds. The Journal of Machine Learning Research, 2010, 11( 4): 1297–1322

[83]	Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248−255

[84]	Crosby P B. Quality is Free: the Art of Making Quality Certain. New York: McGraw-Hill, 1979

[85]	Paun S, Hovy D. Proceedings of the first workshop on aggregating and analysing crowdsourced annotations for NLP. In: Proceedings of the 1st Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP. 2019

[86]	Li C, Markl V, LI Q, Li Y, Gao J, Su L, Zhao B, Demirbas M, Fan W, Han J . A confidence-aware approach for truth discovery on long-tail data. Proceedings of the VLDB Endowment, 2014, 8( 4): 425–436

[87]	Aydin B I, Yilmaz Y S, Li Y, Li Q, Gao J, Demirbas M. Crowdsourcing for multiple-choice question answering. In: Proceedings of 28th the AAAI Conference on Artificial Intelligence. 2014, 2946−2953

[88]	Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques. Cambridge: MIT Press, 2009

[89]	Venanzi M, Guiver J, Kazai G, Kohli P, Shokouhi M. Community-based Bayesian aggregation models for crowdsourcing. In: Proceedings of the 23rd International Conference on World Wide Web. 2014, 155−164

[90]	Cer D, Yang Y, Kong S Y, Hua N, Limtiaco N, John R S, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung Y H, Strope B, Kurzweil R. Universal sentence encoder. 2018, arXiv preprint arXiv: 1803.11175

[91]	Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, 4171−4186

[92]	Vaughan J W . Making better use of the crowd: how crowdsourcing can advance machine learning research. The Journal of Machine Learning Research, 2017, 18( 1): 7026–7071

[93]	Wang Y, Papangelis K, Saker M, Lykourentzou I, Khan V J, Chamberlain A, Grudin J. An examination of the work practices of crowdfarms. In: Proceedings of 2021 CHI Conference on Human Factors in Computing Systems. 2021, 139

[94]	Dellermann D, Calma A, Lipusch N, Weber T, Weigel S, Ebel P. The future of human-AI collaboration: a taxonomy of design knowledge for hybrid intelligence systems. arXiv preprint arXiv:2105.03354, 2021

[95]	Feldman M Q, McInnis B J . How we write with crowds. Proceedings of the ACM on Human-Computer Interaction, 2021, 4( CSCW3): 229

[96]	Ørting S N, Doyle A, van Hilten A, Hirth M, Inel O, Madan C R, Mavridis P, Spiers H, Cheplygina V . A survey of crowdsourcing in medical image analysis. Human Computation, 2020, 7( 1): 1–26

[97]	Kovashka A, Russakovsky O, Fei-Fei L, Grauman K . Crowdsourcing in computer vision. Foundations and Trends® in computer graphics and Vision, 2016, 10( 3): 177–243

[98]	Zheng Y, Li G, Li Y, Shan C, Cheng R . Truth inference in crowdsourcing: is the problem solved?. Proceedings of the VLDB Endowment, 2017, 10( 5): 541–552

[99]	Jin Y, Carman M, Zhu Y, Xiang Y . A technical survey on statistical modelling and design methods for crowdsourcing quality control. Artificial Intelligence, 2020, 287: 103351

[100]

Daniel F, Kucherbaev P, Cappiello C, Benatallah B, Allahbakhsh M . Quality control in crowdsourcing: a survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR), 2018, 51( 1): 7

[101]

Wu G, Chen Z, Liu J, Han D, Qiao B . Task assignment for social-oriented crowdsourcing. Frontiers of Computer Science, 2021, 15( 2): 152316

[102]

Hu Z, Wu W, Luo J, Wang X, Li B . Quality assessment in competition-based software crowdsourcing. Frontiers of Computer Science, 2020, 14( 6): 146207

[103]

Zhu H, Dow S P, Kraut R E, Kittur A. Reviewing versus doing: learning and performance in crowd assessment. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing. 2014, 1445−1455

[104]

Mesbah S, Arous I, Yang J, Bozzon A. HybridEval: a human-AI collaborative approach for evaluating design ideas at scale. In: Proceedings of 2023 ACM Web Conference. 2023, 3837−3848

[105]

Xu A, Huang S W, Bailey B. Voyant: generating structured feedback on visual designs using a crowd of non-experts. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing. 2014, 1433−1444

[106]

Singer Y, Mittal M. Pricing mechanisms for crowdsourcing markets. In: Proceedings of the 22nd International Conference on World Wide Web. 2013, 1157−1166

[107]

Salehi N, Teevan J, Iqbal S, Kamar E. Communicating context to the crowd for complex writing tasks. In: Proceedings of 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. 2017, 1890−1901

[108]

Aguinis H, Villamor I, Ramani R S . MTurk research: review and recommendations. Journal of Management, 2021, 47( 4): 823–837

[109]

Russell B C, Torralba A, Murphy K P, Freeman W T . LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 2008, 77( 1): 157–173

[110]

Kang X, Yu G, Kong L, Domeniconi C, Zhang X, Li Q . FedTA: federated worthy task assignment for crowd workers. IEEE Transactions on Dependable and Secure Computing, 2024, 21( 4): 4098–4109

[111]

Lin X, Wei K, Li Z, Chen J, Pei T . Aggregation-based dual heterogeneous task allocation in spatial crowdsourcing. Frontiers of Computer Science, 2024, 18( 6): 186605

[112]

Doroudi S, Kamar E, Brunskill E, Horvitz E. Toward a learning science for complex crowdsourcing tasks. In: Proceedings of 2016 CHI Conference on Human Factors in Computing Systems. 2016, 2623−2634

[113]

Cychnerski J, Dziubich T. Segmentation quality refinement in large-scale medical image dataset with crowd-sourced annotations. In: Proceedings of the 25th European Conference on Advances in Databases and Information Systems. 2021, 205−216

[114]

Benenson R, Popov S, Ferrari V. Large-scale interactive object segmentation with human annotators. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 11692−11701

[115]

Rechkemmer A, Yin M. Exploring the effects of goal setting when training for complex crowdsourcing tasks (Extended Abstract). In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 4819–4823

[116]

Allahbakhsh M, Arbabi S, Shirazi M, Motahari-Nezhad H R. A task decomposition framework for surveying the crowd contextual insights. In: Proceedings of 2015 IEEE 8th International Conference on Service-Oriented Computing and Applications. 2015, 155−162

[117]

Bragg J, Mausam, Weld D S. Sprout: crowd-powered task design for crowdsourcing. In: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 2018, 165−176

[118]

Dumitrache A, Aroyo L, Welty C. Capturing ambiguity in crowdsourcing frame disambiguation. In: Proceedings of the 6th AAAI Conference on Human Computation and Crowdsourcing. 2018, 12−20

[119]

Biemann C . Creating a system for lexical substitutions from scratch using crowdsourcing. Language Resources and Evaluation, 2013, 47( 1): 97–122

[120]

de Boer P M, Bernstein A . PPLib: toward the automated generation of crowd computing programs using process recombination and auto-experimentation. ACM Transactions on Intelligent Systems and Technology (TIST), 2016, 7( 4): 49

[121]

De Boer P M, Bernstein A. Efficiently identifying a well-performing crowd process for a given problem. In: Proceedings of 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. 2017, 1688−1699

[122]

Bhuiyan M, Zhang A X, Sehat C M, Mitra T . Investigating differences in crowdsourced news credibility assessment: raters, tasks, and expert criteria. Proceedings of the ACM on Human-Computer Interaction, 2020, 4( CSCW2): 93

[123]

Suzuki R, Sakaguchi T, Matsubara M, Kitagawa H, Morishima A. CrowdSheet: instant implementation and out-of-hand execution of complex crowdsourcing. In: Proceedings of the 34th IEEE International Conference on Data Engineering. 2018, 1633−1636

[124]

Dunnmon J A, Ratner A J, Saab K, Khandwala N, Markert M, Sagreiya H, Goldman R, Lee-Messer C, Lungren M P, Rubin D L, Ré C . Cross-modal data programming enables rapid medical machine learning. Patterns, 2020, 1( 2): 100019

[125]

Kittur A, Khamkar S, Andre P, Kraut R. CrowdWeaver: visually managing complex crowd work. In: Proceedings of 2012 ACM Conference on Computer Supported Cooperative Work. 2012, 1033−1036

[126]

Wang J, Ipeirotis P G, Provost F. Quality-based pricing for crowdsourced workers. NYU Working Paper, 2013

[127]

Cheng J, Teevan J, Bernstein M S. Measuring crowdsourcing effort with error-time curves. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 2015, 1365–1374

[128]

Yin M, Chen Y, Sun Y A. The effects of performance-contingent financial incentives in online labor markets. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence. 2013, 1191–1197

[129]

Agapie E, Teevan J, Monroy-Hernandez A. Crowdsourcing in the field: a case study using local crowds for event reporting. In: Proceedings of the 3rd AAAI Conference on Human Computation and Crowdsourcing. 2015, 2–11

[130]

Lasecki W S, Wesley R, Nichols J, Kulkarni A, Allen J F, Bigham J P. Chorus: a crowd-powered conversational assistant. In: Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology. 2013, 151−162

[131]

Drapeau R, Chilton L, Bragg J, Weld D. Microtalk: using argumentation to improve crowdsourcing accuracy. In: Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing. 2016, 32−41

[132]

Xiong T, Yu Y, Pan M, Yang J. SmartCrowd: a workflow framework for complex crowdsourcing tasks. In: Proceedings of the 16th International Conference on Business Process Management. 2018, 387−398

[133]

Rahman H, Roy S B, Thirumuruganathan S, Amer-Yahia S, Das G . Optimized group formation for solving collaborative tasks. The VLDB Journal, 2019, 28( 1): 1–23

[134]

Deci E L, Ryan R M. Self-determination theory. In: Van Lange P A M, Kruglanski A W, Higgins E T, eds. Handbook of Theories of Social Psychology. Washington: SAGE, 2012, 416−436

[135]

Pei W, Yang Z, Chen M, Yue C . Quality control in crowdsourcing based on fine-grained behavioral features. Proceedings of the ACM on Human-Computer Interaction, 2021, 5( CSCW2): 442

[136]

Ba Y, Mancenido M V, Chiou E K, Pan R. Data quality in crowdsourcing and spamming behavior detection. 2024, arXiv preprint arXiv: 2404.17582

[137]

Cui L, Zhao X, Liu L, Yu H, Miao Y . Complex crowdsourcing task allocation strategies employing supervised and reinforcement learning. International Journal of Crowd Science, 2017, 1( 2): 146–160

[138]

Tang F. Optimal complex task assignment in service crowdsourcing. In: Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. 2021, 217

[139]

Mavridis P, Gross-Amblard D, Miklós Z. Using hierarchical skills for optimized task assignment in knowledge-intensive crowdsourcing. In: Proceedings of the 25th International Conference on World Wide Web. 2016, 843−853

[140]

Maarry K E, Balke W T, Cho H, Hwang S W, Baba Y. Skill ontology-based model for quality assurance in crowdsourcing. In: Proceedings of the 19th International Conference on Database Systems for Advanced Applications. 2014, 376−387

[141]

Hettiachchi D, Van Berkel N, Kostakos V, Goncalves J . CrowdCog: a cognitive skill based system for heterogeneous task assignment and recommendation in crowdsourcing. Proceedings of the ACM on Human-Computer Interaction, 2020, 4( CSCW2): 110

[142]

Aris H, Azizan A. A review on the methods to evaluate crowd contributions in crowdsourcing applications. In: Proceedings of the 4th International Conference of Reliable Information and Communication Technology. 2019, 1031−1041

[143]

Zlabinger M, Sabou M, Hofstätter S, Sertkan M, Hanbury A. DEXA: supporting non-expert annotators with dynamic examples from experts. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 2109−2112

[144]

Zhu X. Machine teaching: an inverse problem to machine learning and an approach toward optimal education. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015, 4083−4087

[145]

Chen Q, Bragg J, Chilton L B, Weld D S. Cicero: multi-turn, contextual argumentation for accurate crowdsourcing. In: Proceedings of 2019 CHI Conference on Human Factors in Computing Systems. 2019, 531

[146]

Liu A, Soderland S, Bragg J, Lin C H, Ling X, Weld D S. Effective crowd annotation for relation extraction. In: Proceedings of 2016 Conference of the North American Chapter of the Association for Computational Linguistics. 2016, 897−906

[147]

McInnis B, Leshed G, Cosley D . Crafting policy discussion prompts as a task for newcomers. Proceedings of the ACM on Human-Computer Interaction, 2018, 2( CSCW): 121

[148]

Hettiachchi D, Schaekermann M, McKinney T J, Lease M . The challenge of variable effort crowdsourcing and how visible gold can help. Proceedings of the ACM on Human-Computer Interaction, 2021, 5( CSCW2): 332

[149]

Willett W, Heer J, Agrawala M. Strategies for crowdsourcing social data analysis. In: Proceedings of SIGCHI Conference on Human Factors in Computing Systems. 2012, 227−236

[150]

Hahn N, Chang J, Kim J E, Kittur A. The knowledge accelerator: big picture thinking in small pieces. In: Proceedings of 2016 CHI Conference on Human Factors in Computing Systems. 2016, 2258−2270

[151]

Xu A, Rao H, Dow S P, Bailey B P. A classroom study of using crowd feedback in the iterative design process. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. 2015, 1637−1648

[152]

Kobayashi M, Morita H, Matsubara M, Shimizu N, Morishima A. An empirical study on short-and long-term effects of self-correction in crowdsourced microtasks. In: Proceedings of the 6th AAAI Conference on Human Computation and Crowdsourcing. 2018, 79−87

[153]

Tsvetkova M, Yasseri T, Meyer E T, Pickering J B, Engen V, Walland P, Lüders M, Følstad A, Bravos G . Understanding human-machine networks: a cross-disciplinary survey. ACM Computing Surveys (CSUR), 2017, 50( 1): 12

[154]

Wang Y, Zhang X, Ju Y, Liu Q, Zou Q, Zhang Y, Ding Y, Zhang Y . Identification of human microrna-disease association via low-rank approximation-based link propagation and multiple kernel learning. Frontiers of Computer Science, 2024, 18( 2): 182903

[155]

Li J, Zhang R, Mensah S, Qin W, Hu C . Classification-oriented Dawid skene model for transferring intelligence from crowds to machines. Frontiers of Computer Science, 2023, 17( 5): 175332

[156]

Jayaram S, Allaway E. Human rationales as attribution priors for explainable stance detection. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 5540–5554

[157]

Russakovsky O, Li L J, Fei-Fei L. Best of both worlds: human-machine collaboration for object annotation. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015, 2121−2131

[158]

Gouravajhala S R, Yim J, Desingh K, Huang Y, Jenkins O C, Lasecki W S. EURECA: enhanced understanding of real environments via crowd assistance. In: Proceedings of the 6th AAAI Conference on Human Computation and Crowdsourcing. 2018, 31−40

[159]

Kanchinadam T, Westpfahl K, You Q, Fung G. Rationale-based human-in-the-loop via supervised attention. In: Proceedings of Workshop on Data Science with Human-in-the-Loop. 2020

[160]

Yang J, Smirnova A, Yang D, Demartini G, Lu Y, Cudre-Mauroux P. Scalpel-CD: leveraging crowdsourcing and deep probabilistic modeling for debugging noisy training data. In: Proceedings of World Wide Web Conference. 2019, 2158−2168

[161]

Nushi B, Kamar E, Horvitz E, Kossmann D. On human intellect and machine failures: troubleshooting integrative machine learning systems. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1017−1025

[162]

Das A K, Ashrafi A, Ahmmad M. Joint cognition of both human and machine for predicting criminal punishment in judicial system. In: Proceedings of the 4th IEEE International Conference on Computer and Communication Systems. 2019, 36−40

[163]

Naderi B, Cutler R, Ristea N C. Multi-dimensional speech quality assessment in crowdsourcing. In: Proceedings of 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. 2024, 696−700

[164]

Simpson E D, Venanzi M, Reece S, Kohli P, Guiver J, Roberts S J, Jennings N R. Language understanding in the wild: combining crowdsourcing and machine learning. In: Proceedings of the 24th International Conference on World Wide Web. 2015, 992−1002

[165]

Han G, Tu J, Yu G, Wang J, Domeniconi C. Crowdsourcing with multiple-source knowledge transfer. In: Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. 2021, 2908−2914

[166]

Papineni K, Roukos S, Ward T, Zhu W J. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002, 311−318

[167]

Yu J, Jiang Y, Wang Z, Cao Z, Huang T. UnitBox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia. 2016, 516−520

[168]

Uma A N, Fornaciari T, Hovy D, Paun S, Plank B, Poesio M . Learning from disagreement: a survey. Journal of Artificial Intelligence Research, 2021, 72: 1385–1470

[169]

Ren L, Jiang L, Zhang W, Li C . Label distribution similarity-based noise correction for crowdsourcing. Frontiers of Computer Science, 2024, 18( 5): 185323

[170]

Vittayakorn S, Hays J. Quality assessment for crowdsourced object annotations. In: Proceedings of the British Machine Vision Conference. 2011, 1−11

[171]

Chung J J Y, Song J Y, Kutty S, Hong S, Kim J, Lasecki W S . Efficient elicitation approaches to estimate collective crowd answers. Proceedings of the ACM on Human-Computer Interaction, 2019, 3( CSCW): 62

[172]

Li J, Endo L R, Kashima H. Label aggregation for crowdsourced triplet similarity comparisons. In: Proceedings of the 28th International Conference on Neural Information Processing. 2021, 176−185

[173]

Basile V, Fell M, Fornaciari T, Hovy D, Paun S, Plank B, Poesio M, Uma A. We need to consider disagreement in evaluation. In: Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future. 2021, 15−21

[174]

Timmermans B. Exploiting disagreement through open-ended tasks for capturing interpretation spaces. In: Proceedings of the 13th European Semantic Web Conference. 2016, 873−882

[175]

Klebanov B B, Beigman E. Difficult cases: from data to learning, and back. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014, 390–396

[176]

Zhang J, Jiang X, Tian N, Wu M . Label noise correction for crowdsourcing using dynamic resampling. Engineering Applications of Artificial Intelligence, 2024, 133: 108439

[177]

Yuan Q, Gou G, Zhu Y, Wang Y . MMCo: using multimodal deep learning to detect malicious traffic with noisy labels. Frontiers of Computer Science, 2024, 18( 1): 181809

[178]

Zhang Y, Jiang L, Li C . Attribute augmentation-based label integration for crowdsourcing. Frontiers of Computer Science, 2023, 17( 5): 175331

[179]

Lan O, Huang X, Lin B Y, Jiang H, Liu L, Ren X. Learning to contextually aggregate multi-source supervision for sequence labeling. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2134–2146

[180]

Yang Y, Chen P, Sun H . Incorporating pixel proximity into answer aggregation for crowdsourced image segmentation. CCF Transactions on Pervasive Computing and Interaction, 2022, 4( 2): 172–187

[181]

Hettiachchi D, Kostakos V, Goncalves J . A survey on task assignment in crowdsourcing. ACM Computing Surveys (CSUR), 2022, 55( 3): 49

[182]

He W, Cui L, Huang C. Task assignments in complex collaborative crowdsourcing. In: Proceedings of the 13th CCF Conference on Computer Supported Cooperative Work and Social Computing. 2018, 574−580

[183]

Mo K, Zhong E, Yang Q. Cross-task crowdsourcing. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 677−685

[184]

Khan A R, Garcia-Molina H. CrowdDQS: dynamic question selection in crowdsourcing systems. In: Proceedings of 2017 ACM International Conference on Management of Data. 2017, 1447−1462

[185]

Fan J, Li G, Ooi B C, Tan K L, Feng J. iCrowd: an adaptive crowdsourcing framework. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 1015−1030

[186]

Ma Y, Gao X, Bhatti S S, Chen G . Clustering based priority queue algorithm for spatial task assignment in crowdsourcing. IEEE Transactions on Services Computing, 2024, 17( 2): 452–465

[187]

Ma F, Li Y, Li Q, Qiu M, Gao J, Zhi S, Su L, Zhao B, Ji H, Han J. FaitCrowd: fine grained truth discovery for crowdsourced data aggregation. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015, 745−754

[188]

Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J. Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems. 2009, 2035−2043

[189]

Blei D M, Ng A Y, Jordan M I . Latent dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022

[190]

Welinder P, Branson S, Belongie S, Perona P. The multidimensional wisdom of crowds. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2010, 2424−2432

[191]

Zhao W X, Jiang J, Weng J, He J, Lim E P, Yan H, Li X. Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European Conference on Information Retrieval. 2011, 338−349

[192]

Flower L, Hayes J R . A cognitive process theory of writing. College Composition and Communication, 1981, 32( 4): 365–387

[193]

Kraut R, Galegher J, Fish R, Chalfonte B . Task requirements and media choice in collaborative writing. Human-Computer Interaction, 1992, 7( 4): 375–407

[194]

Russell D M, Stefik M J, Pirolli P, Card S K. The cost structure of sensemaking. In: Proceedings of INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems. 1993, 269−276

[195]

Pirolli P, Card S. The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In: Proceedings of International Conference on Intelligence Analysis. 2005, 2−4

[196]

Ericsson K A, Lehmann A C . Expert and exceptional performance: evidence of maximal adaptation to task constraints. Annual Review of Psychology, 1996, 47: 273–305

[197]

Bradel L, Endert A, Koch K, Andrews C, North C . Large high resolution displays for co-located collaborative sensemaking: display usage and territoriality. International Journal of Human-Computer Studies, 2013, 71( 11): 1078–1088

[198]

Drieger P . Semantic network analysis as a method for visual text analytics. Procedia-Social and Behavioral Sciences, 2013, 79: 4–17

[199]

Sun M, Mi P, North C, Ramakrishnan N . BiSet: semantic edge bundling with biclusters for sensemaking. IEEE Transactions on Visualization and Computer Graphics, 2016, 22( 1): 310–319

[200]

Zhao J, Glueck M, Isenberg P, Chevalier F, Khan A . Supporting handoff in asynchronous collaborative sensemaking using knowledge-transfer graphs. IEEE Transactions on Visualization and Computer Graphics, 2018, 24( 1): 340–350

[201]

Madge C, Yu J, Chamberlain J, Kruschwitz U, Paun S, Poesio M. Crowdsourcing and aggregating nested markable annotations. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 797−807

[202]

Guillaume B, Fort K, Lefebvre N. Crowdsourcing complex language resources: playing to annotate dependency syntax. In: Proceedings of the 26th International Conference on Computational Linguistics. 2016, 3041−3052

[203]

Urbanek J, Fan A, Karamcheti S, Jain S, Humeau S, Dinan E, Rocktaschel T, Kiela D, Szlam A, Weston J. Learning to speak and act in a fantasy text adventure game. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 673−683

[204]

Zhao Y, Wang Y, Duan P, Zhang H, Liu Z, Tong X, Cai Z . Mobile crowdsourcing quality control method based on four-party evolutionary game in edge cloud environment. IEEE Transactions on Computational Social Systems, 2024, 11( 3): 3652–3666

[205]

Muldoon C, O’Grady M J, O’Hare G M P . A survey of incentive engineering for crowdsourcing. The Knowledge Engineering Review, 2018, 33: e2

[206]

Scekic O, Truong H L, Dustdar S. Supporting multilevel incentive mechanisms in crowdsourcing systems: an artifact-centric view. In: Li W, Huhns M N, Tsai W T, Wu W, eds. Crowdsourcing. Berlin: Springer, 2015, 91−111

[207]

Braylan A, Alonso O, Lease M. Measuring annotator agreement generally across complex structured, multi-object, and free-text annotation tasks. In: Proceedings of ACM Web Conference. 2022, 1720−1730

[208]

Gadiraju U, Yang J. What can crowd computing do for the next generation of AI systems? In: Proceedings of Crowd Science Workshop: Remoteness, Fairness, and Mechanisms as Challenges of Data Supply by Humans for Automation co-located with 34th Conference on Neural Information Processing Systems. 2020, 7−13

[209]

OpenAI. GPT-4 technical report. 2024, arXiv preprint arXiv: 2303.08774

[210]

Fulker Z, Riedl C . Cooperation in the gig economy: insights from upwork freelancers. Proceedings of the ACM on Human-Computer Interaction, 2024, 8( CSCW1): 37

[211]

Xu J, Han L, Sadiq S, Demartini G. On the role of large language models in crowdsourcing misinformation assessment. In: Proceedings of the 18th International AAAI Conference on Web and Social Media. 2024, 1674−1686

[212]

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C L, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, Lowe R. Training language models to follow instructions with human feedback. In: Proceedings of the 36th Conference on Neural Information Processing Systems. 2022, 27730−27744

[213]

Bai Y, Jones A, Ndousse K, Askell A, Chen A, , . Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022, arXiv preprint arXiv: 2204.05862

[214]

Anil R, Dai A M, Firat O, Johnson M, Lepikhin D, , . PaLM 2 technical report. 2023, arXiv preprint arXiv: 2305.10403

[215]

Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, , . The Llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783

[216]

DeepSeek-AI. DeepSeek-V3 technical report. 2025, arXiv preprint arXiv: 2412.19437

[217]

Christiano P F, Leike J, Brown T B, Martic M, Legg S, Amodei D. Deep reinforcement learning from human preferences. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 4302−4310

[218]

Wu T, Terry M, Cai C J. AI chains: transparent and controllable human-AI interaction by chaining large language model prompts. In: Proceedings of 2022 CHI conference on human factors in computing systems. 2022, 385

[219]

Kittur A, Smus B, Khamkar S, Kraut R E. CrowdForge: crowdsourcing complex work. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. 2011, 43−52

[220]

Bernstein M S, Little G, Miller R C, Hartmann B, Ackerman M S, Karger D R, Crowell D, Panovich K. Soylent: a word processor with a crowd inside. In: Proceedings of the 23rd Annual ACM Symposium on User Interface Software and Technology. 2010, 313−322

[221]

Wang X, Wei J, Schuurmans D, Le Q V, Chi E H, Narang S, Chowdhery A, Zhou D. Self-consistency improves chain of thought reasoning in language models. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[222]

Press O, Zhang M, Min S, Schmidt L, Smith N A, Lewis M. Measuring and narrowing the compositionality gap in language models. In: Proceedings of Findings of the Association for Computational Linguistics. 2023, 5687−5711

[223]

Shinn N, Cassano F, Gopinath A, Narasimhan K, Yao S. Reflexion: language agents with verbal reinforcement learning. In: Proceedings of the 37th Conference on Neural Information Processing Systems. 2023

[224]

Parameswaran A G, Shankar S, Asawa P, Jain N, Wang Y. Revisiting prompt engineering via declarative crowdsourcing. In: Proceedings of the 14th Conference on Innovative Data Systems Research. 2024

[225]

Veselovsky V, Ribeiro M H, West R. Artificial artificial artificial intelligence: crowd workers widely use large language models for text production tasks. 2023, arXiv preprint arXiv: 2306.07899

[226]

Wu T, Zhu H, Albayrak M, Axon A, Bertsch A, , . LLMs as workers in human-computational algorithms? Replicating crowdsourcing pipelines with LLMs. 2025, arXiv preprint arXiv: 2307.10168

[227]

Hu T, Long C, Xiao C . CRD-CGAN: category-consistent and relativistic constraints for diverse text-to-image generation. Frontiers of Computer Science, 2024, 18( 1): 181304

[228]

He X, Lin Z, Gong Y, Jin A L, Zhang H, Lin C, Jiao J, Yiu S M, Duan N, Chen W. AnnoLLM: making large language models to be better crowdsourced annotators. In: Proceedings of 2024 Conference of the North American Chapter of the Association for Computational Linguistics. 2024, 165−190

[229]

He Z, Huang C Y, Ding C K C, Rohatgi S, Huang T H K. If in a crowdsourced data annotation pipeline, a GPT-4. In: Proceedings of 2024 CHI Conference on Human Factors in Computing Systems. 2024, 1040

RIGHTS & PERMISSIONS

The Author(s) 2025. This article is published with open access at link.springer.com and journal.hep.com.cn