CA-MOO-Clean: Constraint-Aware Multi-Objective Optimization for Data Cleaning

Tianze Hu , Junyi Han , Lianpeng Qiao , Yujin Gao , Kaiyu Li , Yu-Ping Wang , Guoren Wang

Front. Comput. Sci. ››

PDF (27075KB)
Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-52134-4
REVIEW ARTICLE
CA-MOO-Clean: Constraint-Aware Multi-Objective Optimization for Data Cleaning
Author information +
History +
PDF (27075KB)

Abstract

Automated data cleaning systems face an inherent tension: maximizing repair accuracy often requires aggressive modifications that violate the minimum change principle, while preserving data integrity risks missing critical errors. Existing methods optimize a single objective, failing to capture the multi-dimensional nature of this trade-off. We propose CA-MOO-Clean, a framework that formalizes data cleaning as a Constrained Multi-Objective Optimization problem. Our approach integrates NSGA-II evolutionary search with LLM-assisted constraint discovery to automatically generate Pareto-optimal cleaning pipelines, revealing the complete accuracy-modification trade-off frontier. Extensive experiments demonstrate that CA-MOO-Clean achieves competitive or superior accuracy while providing multiple Pareto-optimal solutions per run—93.85% Repair F1 on Hospital (5.91% modification), 86.73% on Adult (0.95% modification), and 93.92% on Credit (2.62% modification). Unlike single-solution baselines (HoloClean, NADEEF, AlphaClean), our framework reveals the full accuracy–modification trade-off frontier, achieving the highest accuracy on all three datasets while providing flexible trade-off selection without re-optimization. The architectural separation of one-time discovery (7.0–68.1s) from rapid execution (0.30–50.20s) enables efficient amortized processing. This work establishes multi-objective evolutionary optimization as a principled methodology for balancing accuracy and data preservation in automated cleaning systems.

Keywords

Data cleaning / Multi-objective optimization / Large language models / Automated data pipelines / Data quality constraints

Cite this article

Download citation ▾
Tianze Hu, Junyi Han, Lianpeng Qiao, Yujin Gao, Kaiyu Li, Yu-Ping Wang, Guoren Wang. CA-MOO-Clean: Constraint-Aware Multi-Objective Optimization for Data Cleaning. Front. Comput. Sci. DOI:10.1007/s11704-026-52134-4

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (27075KB)

38

Accesses

0

Citation

Detail

Sections
Recommended

/