CA-MOO-Clean: Constraint-Aware Multi-Objective Optimization for Data Cleaning
Tianze Hu , Junyi Han , Lianpeng Qiao , Yujin Gao , Kaiyu Li , Yu-Ping Wang , Guoren Wang
Automated data cleaning systems face an inherent tension: maximizing repair accuracy often requires aggressive modifications that violate the minimum change principle, while preserving data integrity risks missing critical errors. Existing methods optimize a single objective, failing to capture the multi-dimensional nature of this trade-off. We propose CA-MOO-Clean, a framework that formalizes data cleaning as a Constrained Multi-Objective Optimization problem. Our approach integrates NSGA-II evolutionary search with LLM-assisted constraint discovery to automatically generate Pareto-optimal cleaning pipelines, revealing the complete accuracy-modification trade-off frontier. Extensive experiments demonstrate that CA-MOO-Clean achieves competitive or superior accuracy while providing multiple Pareto-optimal solutions per run—93.85% Repair F1 on Hospital (5.91% modification), 86.73% on Adult (0.95% modification), and 93.92% on Credit (2.62% modification). Unlike single-solution baselines (HoloClean, NADEEF, AlphaClean), our framework reveals the full accuracy–modification trade-off frontier, achieving the highest accuracy on all three datasets while providing flexible trade-off selection without re-optimization. The architectural separation of one-time discovery (7.0–68.1s) from rapid execution (0.30–50.20s) enables efficient amortized processing. This work establishes multi-objective evolutionary optimization as a principled methodology for balancing accuracy and data preservation in automated cleaning systems.
Data cleaning / Multi-objective optimization / Large language models / Automated data pipelines / Data quality constraints
Higher Education Press 2026
/
| 〈 |
|
〉 |