Optimization in Machine Learning: a Distribution-Space Approach
Yongqiang Cai, Qianxiao Li, Zuowei Shen
Optimization in Machine Learning: a Distribution-Space Approach
We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training parameters. We derive some simple relationships between the distribution-space problem and the original problem, e.g., a distribution-space solution is at least as good as a solution in the original space. Moreover, we develop a numerical algorithm based on mixture distributions to perform approximate optimization directly in the distribution space. Consistency of this approximation is established and the numerical efficacy of the proposed algorithm is illustrated in simple examples. In both theory and practice, this formulation provides an alternative approach to large-scale optimization in machine learning.
Machine learning / Convex relaxation / Optimization / Distribution space
[1.] |
Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242–252 (2019)
|
[2.] |
Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization. In: International Conference on Machine Learning, pp. 244–253 (2018)
|
[3.] |
|
[4.] |
|
[5.] |
|
[6.] |
Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv:1811.02564 (2018)
|
[7.] |
Bengio, Y., Roux, N.L., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural networks. In: Advances in Neural Information Processing Systems 18, pp. 123–130 (2006)
|
[8.] |
|
[9.] |
Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems 31, pp. 3036–3046 (2018)
|
[10.] |
|
[11.] |
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems 27, pp. 2933–2941 (2014)
|
[12.] |
|
[13.] |
Deutsch, L.: Generating neural networks with neural networks. arXiv:1801.01952 (2018)
|
[14.] |
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations, vol. 7, pp. 1–19 (2019)
|
[15.] |
Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272–279 (2008)
|
[16.] |
E, W.N., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039 (2019)
|
[17.] |
|
[18.] |
|
[19.] |
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
|
[20.] |
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672–2680 (2014)
|
[21.] |
Ha, D., Dai, A., Le, Q.V.: Hypernetworks. In: International Conference on Learning Representations, vol. 5, pp. 1–29 (2017)
|
[22.] |
|
[23.] |
|
[24.] |
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (2014)
|
[25.] |
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the 3rd International Conference on Learning Representations (2014)
|
[26.] |
Krueger, D., Huang, C.-W., Islam, R., Turner, R., Lacoste, A., Courville, A.: Bayesian hypernetworks. arXiv:1710.04759 (2017)
|
[27.] |
LeCun, Y.: The MNIST database of handwritten digits. Technical report (1998)
|
[28.] |
|
[29.] |
Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems 31, pp. 8157–8166 (2018)
|
[30.] |
Ma, S., Bassily, R., Belkin, M.: The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning. In: International Conference on Machine Learning, pp. 3325–3334 (2018)
|
[31.] |
Mahoney, M., Martin, C.: Traditional and heavy-tailed self regularization in neural network models. In: International Conference on Machine Learning, pp. 4284–4293 (2019)
|
[32.] |
|
[33.] |
|
[34.] |
|
[35.] |
Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951–4960 (2019)
|
[36.] |
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2008)
|
[37.] |
Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313–1320 (2009)
|
[38.] |
Ratzlaff, N., Fuxin, L.: HyperGAN: a generative model for diverse, performant neural networks. In: International Conference on Machine Learning, pp. 5361–5369 (2019)
|
[39.] |
|
[40.] |
Rotskoff, G., Vanden-Eijnden, E.: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems, pp. 7146–7155 (2018)
|
[41.] |
Sinha, A., Duchi, J.C.: Learning kernels with random features. In: Advances in Neural Information Processing Systems, pp. 1298–1306 (2016)
|
[42.] |
|
[43.] |
Sonoda, S.: Fast approximation and estimation bounds of kernel quadrature for infinitely wide models. arXiv:1902.00648 (2019)
|
[44.] |
|
[45.] |
|
[46.] |
Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning 28, pp. 1058–1066 (2013)
|
[47.] |
Wang, W., Carreira-Perpinán, M.A.: Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arXiv:1309.1541 (2013)
|
[48.] |
Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets v.s. their induced kernel. arXiv:1810.05369 (2018)
|
[49.] |
Wu, L., Zhu, Z., E, W.: Towards understanding generalization of deep learning: perspective of loss landscapes. In: ICML Workshop on Principled Approaches to Deep Learning (2017)
|
[50.] |
Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. In: International Conference on Machine Learning, pp. 5393–5402 (2018)
|
[51.] |
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017)
|
[52.] |
|
[53.] |
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017)
|
/
〈 | 〉 |