Optimization in Machine Learning: a Distribution-Space Approach

Yongqiang Cai, Qianxiao Li, Zuowei Shen

Communications on Applied Mathematics and Computation ›› 2024, Vol. 6 ›› Issue (2) : 1217-1240. DOI: 10.1007/s42967-023-00322-5
Original Paper

Optimization in Machine Learning: a Distribution-Space Approach

Author information +
History +

Abstract

We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training parameters. We derive some simple relationships between the distribution-space problem and the original problem, e.g., a distribution-space solution is at least as good as a solution in the original space. Moreover, we develop a numerical algorithm based on mixture distributions to perform approximate optimization directly in the distribution space. Consistency of this approximation is established and the numerical efficacy of the proposed algorithm is illustrated in simple examples. In both theory and practice, this formulation provides an alternative approach to large-scale optimization in machine learning.

Keywords

Machine learning / Convex relaxation / Optimization / Distribution space

Cite this article

Download citation ▾
Yongqiang Cai, Qianxiao Li, Zuowei Shen. Optimization in Machine Learning: a Distribution-Space Approach. Communications on Applied Mathematics and Computation, 2024, 6(2): 1217‒1240 https://doi.org/10.1007/s42967-023-00322-5

References

[1.]
Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242–252 (2019)
[2.]
Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization. In: International Conference on Machine Learning, pp. 244–253 (2018)
[3.]
Bach F. Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res., 2017, 18(1): 629-681
[4.]
Bacharoglou A. Approximation of probability distributions by convex mixtures of Gaussian measures. Proc. Am. Math. Soc., 2010, 138(7): 2619-2628,
CrossRef Google scholar
[5.]
Barron AR. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory, 1993, 39(3): 930-945,
CrossRef Google scholar
[6.]
Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv:1811.02564 (2018)
[7.]
Bengio, Y., Roux, N.L., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural networks. In: Advances in Neural Information Processing Systems 18, pp. 123–130 (2006)
[8.]
Breiman L. Bagging predictors. Mach. Learn., 1996, 24(2): 123-140,
CrossRef Google scholar
[9.]
Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems 31, pp. 3036–3046 (2018)
[10.]
Cybenko G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems, 1989, 2(4): 303-314,
CrossRef Google scholar
[11.]
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems 27, pp. 2933–2941 (2014)
[12.]
Dean DS. Langevin equation for the density of a system of interacting Langevin processes. J. Phys. A: Math. Gen., 1996, 29(24): L613,
CrossRef Google scholar
[13.]
Deutsch, L.: Generating neural networks with neural networks. arXiv:1801.01952 (2018)
[14.]
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations, vol. 7, pp. 1–19 (2019)
[15.]
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279 (2008)
[16.]
E, W.N., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039 (2019)
[17.]
Eschrig H. . The Fundamentals of Density Functional Theory, 1996 Stuttgart Teubner,
CrossRef Google scholar
[18.]
Freund Y, Schapire RE. Experiments with a new boosting algorithm. Int. Conf. Mach. Learn., 1996, 96: 148-156
[19.]
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
[20.]
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672–2680 (2014)
[21.]
Ha, D., Dai, A., Le, Q.V.: Hypernetworks. In: International Conference on Learning Representations, vol. 5, pp. 1–29 (2017)
[22.]
Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Networks, 1989, 2(5): 359-366,
CrossRef Google scholar
[23.]
Jain P, Kar P. Non-convex optimization for machine learning. Found. Trends Mach. Learn., 2017, 10(3/4): 142-336,
CrossRef Google scholar
[24.]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (2014)
[25.]
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the 3rd International Conference on Learning Representations (2014)
[26.]
Krueger, D., Huang, C.-W., Islam, R., Turner, R., Lacoste, A., Courville, A.: Bayesian hypernetworks. arXiv:1710.04759 (2017)
[27.]
LeCun, Y.: The MNIST database of handwritten digits. Technical report (1998)
[28.]
Leshno M, Lin VY, Pinkus A, Schocken S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 1993, 6(6): 861-867,
CrossRef Google scholar
[29.]
Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems 31, pp. 8157–8166 (2018)
[30.]
Ma, S., Bassily, R., Belkin, M.: The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning. In: International Conference on Machine Learning, pp. 3325–3334 (2018)
[31.]
Mahoney, M., Martin, C.: Traditional and heavy-tailed self regularization in neural network models. In: International Conference on Machine Learning, pp. 4284–4293 (2019)
[32.]
Martin CH, Mahoney MW. Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. J. Mach. Learn. Res., 2021, 22(165): 1-73
[33.]
Mei S, Montanari A, Nguyen P-M. A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci., 2018, 115(33): 7665-7671,
CrossRef Google scholar
[34.]
Nestoridis V, Schmutzhard S, Stefanopoulos V. Universal series induced by approximate identities and some relevant applications. J. Approx. Theory, 2011, 163(12): 1783-1797,
CrossRef Google scholar
[35.]
Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951–4960 (2019)
[36.]
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2008)
[37.]
Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313–1320 (2009)
[38.]
Ratzlaff, N., Fuxin, L.: HyperGAN: a generative model for diverse, performant neural networks. In: International Conference on Machine Learning, pp. 5361–5369 (2019)
[39.]
Rokach L. Ensemble-based classifiers. Artif. Intell. Rev., 2010, 33(1/2): 1-39,
CrossRef Google scholar
[40.]
Rotskoff, G., Vanden-Eijnden, E.: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems, pp. 7146–7155 (2018)
[41.]
Sinha, A., Duchi, J.C.: Learning kernels with random features. In: Advances in Neural Information Processing Systems, pp. 1298–1306 (2016)
[42.]
Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math., 2020, 80(2): 725-752,
CrossRef Google scholar
[43.]
Sonoda, S.: Fast approximation and estimation bounds of kernel quadrature for infinitely wide models. arXiv:1902.00648 (2019)
[44.]
Sorenson HW, Alspach DL. Recursive Bayesian estimation using Gaussian sums. Automatica, 1971, 7(4): 465-479,
CrossRef Google scholar
[45.]
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 2014, 15: 1929-1958
[46.]
Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning 28, pp. 1058–1066 (2013)
[47.]
Wang, W., Carreira-Perpinán, M.A.: Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arXiv:1309.1541 (2013)
[48.]
Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets v.s. their induced kernel. arXiv:1810.05369 (2018)
[49.]
Wu, L., Zhu, Z., E, W.: Towards understanding generalization of deep learning: perspective of loss landscapes. In: ICML Workshop on Principled Approaches to Deep Learning (2017)
[50.]
Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. In: International Conference on Machine Learning, pp. 5393–5402 (2018)
[51.]
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017)
[52.]
Zeevi AJ, Meir R. Density estimation through convex combinations of densities: approximation and estimation bounds. Neural Networks, 1997, 10(1): 99-109,
CrossRef Google scholar
[53.]
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017)
Funding
National Natural Science Foundation of China(12201053); National Research Foundation Singapore(NRF-NRFF13-2021-0005)

Accesses

Citations

Detail

Sections
Recommended

/