Attribute augmentation-based label integration for crowdsourcing

Yao ZHANG, Liangxiao JIANG, Chaoqun LI

PDF(4214 KB)
PDF(4214 KB)
Front. Comput. Sci. ›› 2023, Vol. 17 ›› Issue (5) : 175331. DOI: 10.1007/s11704-022-2225-z
Artificial Intelligence
RESEARCH ARTICLE

Attribute augmentation-based label integration for crowdsourcing

Author information +
History +

Abstract

Crowdsourcing provides an effective and low-cost way to collect labels from crowd workers. Due to the lack of professional knowledge, the quality of crowdsourced labels is relatively low. A common approach to addressing this issue is to collect multiple labels for each instance from different crowd workers and then a label integration method is used to infer its true label. However, to our knowledge, almost all existing label integration methods merely make use of the original attribute information and do not pay attention to the quality of the multiple noisy label set of each instance. To solve these issues, this paper proposes a novel three-stage label integration method called attribute augmentation-based label integration (AALI). In the first stage, we design an attribute augmentation method to enrich the original attribute space. In the second stage, we develop a filter to single out reliable instances with high-quality multiple noisy label sets. In the third stage, we use majority voting to initialize integrated labels of reliable instances and then use cross-validation to build multiple component classifiers on reliable instances to predict all instances. Experimental results on simulated and real-world crowdsourced datasets demonstrate that AALI outperforms all the other state-of-the-art competitors.

Graphical abstract

Keywords

crowdsourcing / label integration / attribute augmentation / instance filtering

Cite this article

Download citation ▾
Yao ZHANG, Liangxiao JIANG, Chaoqun LI. Attribute augmentation-based label integration for crowdsourcing. Front. Comput. Sci., 2023, 17(5): 175331 https://doi.org/10.1007/s11704-022-2225-z

Yao Zhang is currently a MSc student at the School of Computer Science, China University of Geosciences, China. Her research interests mainly include machine learning and data mining (MLDM)

Liangxiao Jiang is currently a professor at the School of Computer Science, China University of Geosciences, China. His research interests mainly include machine learning and data mining (MLDM). In MLDM domains, he has already published more than 90 papers

Chaoqun Li is currently an associate professor at the School of Mathematics and Physics, China University of Geosciences, China. Her research interests mainly include machine learning and data mining (MLDM). In MLDM domains, she has already published more than 50 papers

References

[1]
Jiang L, Zhang L, Yu L, Wang D . Class-specific attribute weighted naive Bayes. Pattern Recognition, 2019, 88: 321–330
[2]
Dong Y, Jiang L, Li C . Improving data and model quality in crowdsourcing using co-training-based noise correction. Information Sciences, 2022, 583: 174–188
[3]
Chen Z, Jiang L, Li C . Label distribution-based noise correction for multiclass crowdsourcing. International Journal of Intelligent Systems, 2022, 37( 9): 5752–5767
CrossRef Google scholar
[4]
Zhang N, Xue J, Ma Y, Zhang R, Liang T, Tan Y A . Hybrid sequence-based android malware detection using natural language processing. International Journal of Intelligent Systems, 2021, 36( 10): 5770–5784
[5]
Hu Y, Ou Z, Xu X, Song M. A crowdsourcing repeated annotations system for visual object detection. In: Proceedings of the 3rd International Conference on Vision, Image and Signal Processing. 2019, 14
[6]
Ocquaye E N N, Mao Q, Xue Y, Song H . Cross lingual speech emotion recognition via triple attentive asymmetric convolutional neural network. International Journal of Intelligent Systems, 2021, 36( 1): 53–71
[7]
Sheng V S, Provost F, Ipeirotis P G. Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 614−622
[8]
Tian T, Zhu J, You B . Max-margin majority voting for learning from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41( 10): 2480–2494
[9]
Sheng V S, Zhang J. Machine learning with crowdsourcing: a brief summary of the past research and future directions. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 9837−9843
[10]
Zhang J . Knowledge learning with crowdsourcing: a brief review and systematic perspective. IEEE/CAA Journal of Automatica Sinica, 2022, 9( 5): 749–762
[11]
Dawid A P, Skene A M . Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 1979, 28( 1): 20–28
[12]
Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of the 21st International Conference on World Wide Web. 2012, 469−478
[13]
Raykar V C, Yu S, Zhao L H, Valadez G H, Florin C, Bogoni L, Moy L . Learning from crowds. The Journal of Machine Learning Research, 2010, 11: 1297–1322
[14]
Gemalmaz M A, Yin M. Accounting for confirmation bias in crowdsourced label aggregation. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence. 2021, 1729−1735
[15]
Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J. Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2009, 2035−2043
[16]
Han T, Sun H, Song Y, Fang Y, Liu X . Find truth in the hands of the few: acquiring specific knowledge with crowdsourcing. Frontiers of Computer Science, 2021, 15( 4): 154315
[17]
Zhang J, Wu X . Multi-label truth inference for crowdsourcing using mixture models. IEEE Transactions on Knowledge and Data Engineering, 2021, 33( 5): 2083–2095
[18]
Rodrigues F, Pereira F C. Deep learning from crowds. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. 2018, 1611−1618
[19]
Guan M Y, Gulshan V, Dai A M, Hinton G E. Who said what: modeling individual labelers improves classification. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. 2018, 3109−3118
[20]
Atarashi K, Oyama S, Kurihara M. Semi-supervised learning from crowds using deep generative models. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. 2018, 1555–1562
[21]
Li S Y, Huang S J, Chen S . Crowdsourcing aggregation with deep Bayesian learning. Science China Information Sciences, 2021, 64( 3): 130104
[22]
Sheng V S, Zhang J, Gu B, Wu X . Majority voting and pairing with multiple noisy labeling. IEEE Transactions on Knowledge and Data Engineering, 2019, 31( 7): 1355–1368
[23]
Tao F, Jiang L, Li C . Label similarity-based weighted soft majority voting and pairing for crowdsourcing. Knowledge and Information Systems, 2020, 62( 7): 2521–2538
[24]
Tao F, Jiang L, Li C . Differential evolution-based weighted soft majority voting for crowdsourcing. Engineering Applications of Artificial Intelligence, 2021, 106: 104474
[25]
Karger D R, Oh S, Shah D . Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 2014, 62( 1): 1–24
[26]
Li H, Yu B. Error rate bounds and iterative weighted majority voting for crowdsourcing. 2014, arXiv preprint arXiv: 1411.4086
[27]
Zhang J, Wu X, Sheng V S . Imbalanced multiple noisy labeling. IEEE Transactions on Knowledge and Data Engineering, 2015, 27( 2): 489–503
[28]
Zhang J, Sheng V S, Wu J, Wu X . Multi-class ground truth inference in crowdsourcing with clustering. IEEE Transactions on Knowledge and Data Engineering, 2016, 28( 4): 1080–1085
[29]
Zhang J, Wu M, Sheng V S . Ensemble learning from crowds. IEEE Transactions on Knowledge and Data Engineering, 2019, 31( 8): 1506–1519
[30]
Jiang L, Zhang H, Tao F, Li C. Learning from crowds with multiple noisy label distribution propagation. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(11): 6558−6568
[31]
Zhang J, Sheng V S, Nicholson B, Wu X . CEKA: a tool for mining the wisdom of crowds. The Journal of Machine Learning Research, 2015, 16( 1): 2853–2858
[32]
Witten I H, Frank E, Hall M A. Data Mining: Practical Machine Learning Tools and Techniques. 3rd ed. Morgan Kaufmann: Elsevier, 2011
[33]
Langley P, Iba W, Thompson K. An analysis of Bayesian classifiers. In: Proceedings of the Tenth National Conference on Artificial Intelligence. 1992, 223−228
[34]
Quinlan J R. C4.5: Programs for Machine Learning. San Mateo: Morgan Kaufmann Publishers, 1993
[35]
le Cessie S, van Houwelingen J C . Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 1992, 41( 1): 191–201
[36]
Alcala-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera H. KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2−3): 255−287
[37]
Demšar J . Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 2006, 7: 1–30
[38]
Jiang L, Zhang L, Li C, Wu J . A correlation-based feature weighting filter for naive Bayes. IEEE Transactions on Knowledge and Data Engineering, 2019, 31( 2): 201–213
[39]
Oliva A, Torralba A . Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision, 2001, 42( 3): 145–175

Acknowledgements

The work was supported by the Science and Technology Project of Hubei Province-Unveiling System (2021BEC007) and the Industry-University-Research Innovation Funds for Chinese Universities (2020ITA05008).

RIGHTS & PERMISSIONS

2023 Higher Education Press
AI Summary AI Mindmap
PDF(4214 KB)

Accesses

Citations

Detail

Sections
Recommended

/