CA-MOO-Clean: Constraint-Aware Multi-Objective Optimization for Data Cleaning

Tianze Hu; Junyi Han; Lianpeng Qiao; Yujin Gao; Kaiyu Li; Yu-Ping Wang; Guoren Wang

doi:10.1007/s11704-026-52134-4

Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-52134-4

REVIEW ARTICLE

CA-MOO-Clean: Constraint-Aware Multi-Objective Optimization for Data Cleaning

Author information +

History +

PDF (27075KB)

Abstract

Automated data cleaning systems face an inherent tension: maximizing repair accuracy often requires aggressive modifications that violate the minimum change principle, while preserving data integrity risks missing critical errors. Existing methods optimize a single objective, failing to capture the multi-dimensional nature of this trade-off. We propose CA-MOO-Clean, a framework that formalizes data cleaning as a Constrained Multi-Objective Optimization problem. Our approach integrates NSGA-II evolutionary search with LLM-assisted constraint discovery to automatically generate Pareto-optimal cleaning pipelines, revealing the complete accuracy-modification trade-off frontier. Extensive experiments demonstrate that CA-MOO-Clean achieves competitive or superior accuracy while providing multiple Pareto-optimal solutions per run—93.85% Repair F1 on Hospital (5.91% modification), 86.73% on Adult (0.95% modification), and 93.92% on Credit (2.62% modification). Unlike single-solution baselines (HoloClean, NADEEF, AlphaClean), our framework reveals the full accuracy–modification trade-off frontier, achieving the highest accuracy on all three datasets while providing flexible trade-off selection without re-optimization. The architectural separation of one-time discovery (7.0–68.1s) from rapid execution (0.30–50.20s) enables efficient amortized processing. This work establishes multi-objective evolutionary optimization as a principled methodology for balancing accuracy and data preservation in automated cleaning systems.

Keywords

Data cleaning / Multi-objective optimization / Large language models / Automated data pipelines / Data quality constraints

Cite this article

Download citation ▾

Tianze Hu, Junyi Han, Lianpeng Qiao, Yujin Gao, Kaiyu Li, Yu-Ping Wang, Guoren Wang. CA-MOO-Clean: Constraint-Aware Multi-Objective Optimization for Data Cleaning. Front. Comput. Sci. DOI:10.1007/s11704-026-52134-4

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (27075KB)

205

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS

Just Accepted