Synthesizing tables for supervised learning

Yaoyu ZHU; Guoliang LI; Jianhua FENG; Nan TANG

doi:10.1007/s11704-025-40424-2

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (3) :2003603 DOI: 10.1007/s11704-025-40424-2

Information Systems

RESEARCH ARTICLE

Synthesizing tables for supervised learning

Author information +

History +

PDF (1394KB)

Abstract

Synthesizing tables – which generates fake tables that resemble real tables – is important for supervised machine learning (ML) with many practical applications, such as generating more data as a way of data augmentation or publishing synthetic tables while preserving the privacy of real tables. A fundamental question is: Given a real table, can we synthesize a table such that ML models trained on the real or the synthetic table perform similarly on an unseen test table?

Existing table synthesis works, mostly using deep generative models (e.g., GAN or VAE), try to learn the density function (or true data distribution) of the real table from sampled real records, by treating each record separately. In practice, the assumption that records are not correlated is often violated (e.g., different purchase records for the same product should be correlated). Failing to capture such correlation across records will result in a synthesized table that does not resemble the real table. Consequently, the ML model trained using the synthetic table performs very differently from the ML model trained using the real table. In this paper, we explicitly model such record correlation as groups determined by user-specified categorical values, e.g., records with the same (Market, Product) values should be in the same group. Given such groups, we propose to use conditional GANs to simultaneously model the probability densities of discrete (e.g., categorical) values and continuous (e.g., numeric) values that may co-exist within a record, to ensure that both global data distribution (within a table) and local data distributions (within groups of a table) are preserved in the synthetic table. Moreover, we extend previous differentially private GAN (DPGAN) that only ensures the discriminator of GAN is differentially private to further ensure the privacy of the original data embeddings and sample frequencies. Experiments show that our approach significantly outperforms the SOTA table synthesis methods for supervised ML, and can well protect privacy while providing high utility.

Graphical abstract

Keywords

generative adversarial networks / data synthesis / privacy

Cite this article

Download citation ▾

Yaoyu ZHU, Guoliang LI, Jianhua FENG, Nan TANG. Synthesizing tables for supervised learning. Front. Comput. Sci., 2026, 20 (3) : 2003603 DOI:10.1007/s11704-025-40424-2

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Xie L, Lin K, Wang S, Wang F, Zhou J. Differentially private generative adversarial network. 2018, arXiv preprint arXiv: 1802.06739

[2]	Liu Y, Peng J, Yu J J Q, Wu Y. PPGAN: privacy-preserving generative adversarial network. In: Proceedings of the 25th IEEE International Conference on Parallel and Distributed Systems. 2019, 985–989

[3]	Kingma D P, Welling M. Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations. 2014

[4]	Blei D M, Kucukelbir A, McAuliffe J D . Variational inference: a review for statisticians. Journal of the American Statistical Association, 2017, 112( 518): 859–877

[5]	Kingma D P, Dhariwal P. Glow: generative flow with invertible 1×1 convolutions. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 10236–10245

[6]	Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2014, 2672–2680

[7]	Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas D. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5908–5916

[8]	Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros A A. Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2536–2544

[9]	Wu J, Zhang C, Xue T, Freeman W T, Tenenbaum J B. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 82–90

[10]	Park N, Mohammadi M, Gorde K, Jajodia S, Park H, Kim Y . Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment, 2018, 11( 10): 1071–1083

[11]	Chen H, Jajodia S, Liu J, Park N, Sokolov V, Subrahmanian V S. Faketables: Using GANs to generate functional dependency preserving tables with bounded real data. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 2074–2080

[12]	Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional GAN. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 659

[13]	Choi E, Biswal S, Malin B A, Duke J, Stewart W F, Sun J. Generating multi-label discrete electronic health records using generative adversarial networks. 2017, arXiv preprint arXiv: 1703.06490

[14]	Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge: MIT Press, 2016

[15]	Lucic M, Kurach K, Michalski M, Bousquet O, Gelly S. Are GANs created equal? A large-scale study. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2017, 698–707

[16]	Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks. In: Proceedings of the 5th International Conference on Learning Representations. 2017

[17]	Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. Improved techniques for training GANs. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 2234–2242

[18]	Goodfellow I. NIPS 2016 tutorial: generative adversarial networks. 2017, arXiv preprint arXiv: 1701.00160

[19]	Fan J, Chen J, Liu T, Shen Y, Li G, Du X . Relational data synthesis using generative adversarial networks: a design space exploration. Proceedings of the VLDB Endowment, 2020, 13( 12): 1962–1975

[20]	Srivastava A, Valkov L, Russell C, Gutmann M U, Sutton C. VEEGAN: reducing mode collapse in GANs using implicit variational learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 3310–3320

[21]	Mirza M, Osindero S. Conditional generative adversarial nets. 2014, arXiv preprint arXiv: 1411.1784

[22]	Ma L, Ding B, Das S, Swaminathan A. Active learning for ML enhanced database systems. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 175–191

[23]	Dalenius T . Finding a needle in a haystack or identifying anonymous census records. Journal of Official Statistics, 1986, 2( 3): 329

[24]	Shafer G, Vovk V. Probability and Finance: It’s Only a Game! Hoboken: Wiley, 2001

[25]	Dwork C. Differential privacy. In: Proceedings of the 33rd International Conference on Automata, Languages and Programming. 2006, 1–12

[26]	Dwork C, Roth A . The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 2014, 9( 3–4): 211–407

[27]	Kifer D, Machanavajjhala A. No free lunch in data privacy. In: Proceedings of 2011 ACM SIGMOD International Conference on Management of Data. 2011, 193–204

[28]	Li T, Li N. On the tradeoff between privacy and utility in data publishing. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 517–526

[29]	Basciftci Y O, Wang Y, Ishwar P. On privacy-utility tradeoffs for constrained data release mechanisms. In: Proceedings of the Information Theory and Applications Workshop. 2016, 1–6

[30]	Abadi M, Chu A, Goodfellow I, McMahan H B, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. In: Proceedings of 2016 ACM SIGSAC Conference on Computer and Communications Security. 2016, 308–318

[31]	Arjovsky M, Chintala S, Bottou L. Wasserstein GAN. 2017, arXiv preprint arXiv: 1701.07875

[32]	Hilprecht B, Schmidt A, Kulessa M, Molina A, Kersting K, Binnig C. Deepdb: learn from data, not from queries! Proceedings of the VLDB Endowment, 2020, 13(7): 992–1005

[33]	Yang Z, Kamsetty A, Luan S, Liang E, Duan Y, Chen X, Stoica I . NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, 2020, 14( 1): 61–73

[34]	Bengio Y, Courville A, Vincent P . Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35( 8): 1798–1828

[35]	O’Shea T, Hoydis J . An introduction to deep learning for the physical layer. IEEE Transactions on Cognitive Communications and Networking, 2017, 3( 4): 563–575

[36]	Lehmann E L, Casella G. Theory of Point Estimation. 2nd ed. New York: Springer, 1998

[37]	Eltoft T, Kim T, Lee T W . On the multivariate Laplace distribution. IEEE Signal Processing Letters, 2006, 13( 5): 300–303

[38]	Yang W, Wang S, Sun Y, Peng Z . Fast dataset search with earth mover’s distance. Proceedings of the VLDB Endowment, 2022, 15( 11): 2517–2529

[39]	Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of the 4th International Conference on Learning Representations. 2016

[40]	Fukushima K, Miyake S . Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 1982, 15( 6): 455–469

[41]	LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, Jackel L D . Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1( 4): 541–551

[42]	Baowaly M K, Lin C C, Liu C L, Chen K T . Synthesizing electronic health records using improved generative adversarial networks. Journal of the American Medical Informatics Association, 2019, 26( 3): 228–241

RIGHTS & PERMISSIONS

Higher Education Press

PDF (1394KB)

2363

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates