Synthesizing tables for supervised learning
Yaoyu ZHU , Guoliang LI , Jianhua FENG , Nan TANG
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (3) : 2003603
Synthesizing tables – which generates fake tables that resemble real tables – is important for supervised machine learning (ML) with many practical applications, such as generating more data as a way of data augmentation or publishing synthetic tables while preserving the privacy of real tables. A fundamental question is: Given a real table, can we synthesize a table such that ML models trained on the real or the synthetic table perform similarly on an unseen test table?
Existing table synthesis works, mostly using deep generative models (e.g., GAN or VAE), try to learn the density function (or true data distribution) of the real table from sampled real records, by treating each record separately. In practice, the assumption that records are not correlated is often violated (e.g., different purchase records for the same product should be correlated). Failing to capture such correlation across records will result in a synthesized table that does not resemble the real table. Consequently, the ML model trained using the synthetic table performs very differently from the ML model trained using the real table. In this paper, we explicitly model such record correlation as groups determined by user-specified categorical values, e.g., records with the same (Market, Product) values should be in the same group. Given such groups, we propose to use conditional GANs to simultaneously model the probability densities of discrete (e.g., categorical) values and continuous (e.g., numeric) values that may co-exist within a record, to ensure that both global data distribution (within a table) and local data distributions (within groups of a table) are preserved in the synthetic table. Moreover, we extend previous differentially private GAN (DPGAN) that only ensures the discriminator of GAN is differentially private to further ensure the privacy of the original data embeddings and sample frequencies. Experiments show that our approach significantly outperforms the SOTA table synthesis methods for supervised ML, and can well protect privacy while providing high utility.
generative adversarial networks / data synthesis / privacy
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
Shafer G, Vovk V. Probability and Finance: It’s Only a Game! Hoboken: Wiley, 2001 |
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
Hilprecht B, Schmidt A, Kulessa M, Molina A, Kersting K, Binnig C. Deepdb: learn from data, not from queries! Proceedings of the VLDB Endowment, 2020, 13(7): 992–1005 |
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
Higher Education Press
/
| 〈 |
|
〉 |