Strong Overall Error Analysis for the Training of Artificial Neural Networks Via Random Initializations

Arnulf Jentzen; Adrian Riekert

doi:10.1007/s40304-022-00292-9

Communications in Mathematics and Statistics ›› 2023, Vol. 12 ›› Issue (3) : 385-434. DOI: 10.1007/s40304-022-00292-9

Article

Strong Overall Error Analysis for the Training of Artificial Neural Networks Via Random Initializations

Arnulf Jentzen¹^,² ,
Adrian Riekert²^,^b

Author information +

History +

Abstract

Although deep learning-based approximation algorithms have been applied very successfully to numerous problems, at the moment the reasons for their performance are not entirely understood from a mathematical point of view. Recently, estimates for the convergence of the overall error have been obtained in the situation of deep supervised learning, but with an extremely slow rate of convergence. In this note, we partially improve on these estimates. More specifically, we show that the depth of the neural network only needs to increase much slower in order to obtain the same rate of approximation. The results hold in the case of an arbitrary stochastic optimization algorithm with i.i.d. random initializations.

Keywords

Deep learning / Artificial intelligence / Empirical risk minimization / Optimization

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Arnulf Jentzen, Adrian Riekert. Strong Overall Error Analysis for the Training of Artificial Neural Networks Via Random Initializations. Communications in Mathematics and Statistics, 2023, 12(3): 385‒434 https://doi.org/10.1007/s40304-022-00292-9

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1.]

Bach

, Moulines

. Burges

CJC

, Bottou

, Welling

, Ghahramani

, Weinberger

. Non-strongly-convex smooth stochastic approximation with convergence rate O ( 1 / n ) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(1/n)$$\end{document}. Advances in Neural Information Processing Systems, 2013 Red Hook Curran Associates, Inc. 773-781

[2.]	Barron AR. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory, 1993, 39(3): 930-945, CrossRef Google scholar

[3.]	Bartlett PL, Bousquet O, Mendelson S. Local Rademacher complexities. Ann. Stat., 2005, 33(4): 1497-1537, CrossRef Google scholar

[4.]	Beck, C., Jentzen, A., Kuckuck, B.: Full error analysis for the training of deep neural networks. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 25 (2022), no. 2, Paper No. 2150020, 76 pp. 0219–0257, 65G99 (68T07), 4440215

[5.]	Bercu B , Fort J-C. . Generic Stochastic Gradient Methods, 2013 New York American Cancer Society 1-8

[6.]

Berner

, Grohs

, Jentzen

. Analysis of the generalization error: empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations. SIAM J. Math. Data Sci., 2020, 2(3): 631-657,

CrossRef Google scholar

[7.]	Cucker F, Smale S. On the mathematical foundations of learning. Bull. Am. Math. Soc. (N.S.), 2002, 39(1): 1-49, CrossRef Google scholar

[8.]	Cybenko G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst., 1989, 2(4): 303-314, CrossRef Google scholar

[9.]	Du, S.S., Zhai, X., Poczós, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks (2018). arXiv:1810.02054

[10.]

, Ma

, Wu

. A priori estimates of the population risk for two-layer neural networks. Commun. Math. Sci., 2019, 17(5): 1407-1425,

CrossRef Google scholar

[11.]

, Ma

, Wu

. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math., 2020, 63: 1235-1258,

CrossRef Google scholar

[12.]

Funahashi

K-I

. On the approximate realization of continuous mappings by neural networks. Neural Netw., 1989, 2(3): 183-192,

CrossRef Google scholar

[13.]

Goodfellow

, Bengio

, Courville

. . Deep Learning. Adaptive Computation and Machine Learning, 2016 Cambridge, MA MIT Press

[14.]

Grohs, P., Hornung, F., Jentzen, A., von Wurstemberger, P.: A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations (2018). arXiv:1809.02362 (Accepted in Mem. Am. Math. Soc.)

[15.]

Grohs, P., Hornung, F., Jentzen, A., Zimmermann, P.: Space-time error estimates for deep neural network approximations for differential equations. Adv. Comput. Math. 49 (2023), no. 1, Paper No. 4. 1019-7168, 65M99 (65M15 68T07), 4534487

[16.]

Grohs, P., Jentzen, A., Salimova, D.: Deep neural network approximations for solutions of PDEs based on Monte Carlo algorithms. Partial Differ. Equ. Appl. 3 (2022), no. 4, Paper No. 45, 41 pp. 2662–2963, 65C05 (35-04), 4437476

[17.]

Györfi

, Kohler

, Krzyżak

, Walk

. . A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics, 2002 New York Springer,

CrossRef Google scholar

[18.]

Hartman

, Keeler

, Kowalski

. Layered neural networks with Gaussian hidden units as universal approximations. Neural Comput., 1990, 2(2): 210-215,

CrossRef Google scholar

[19.]

Hornik

. Approximation capabilities of multilayer feedforward networks. Neural Netw., 1991, 4(2): 251-257,

CrossRef Google scholar

[20.]

Hornik

, Stinchcombe

, White

. Multilayer feedforward networks are universal approximators. Neural Netw., 1989, 2(5): 359-366,

CrossRef Google scholar

[21.]

Jentzen, A., Welti, T.: Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation (2020). arXiv:2003.01291

[22.]

Lu, J., Shen, Z., Yang, H., Zhang, S.: Deep network approximation for smooth functions (2020). arXiv:2001.03040

[23.]

Massart, P.: Concentration Inequalities and Model Selection. Vol. 1896 of Lecture Notes in Mathematics, vol. 1896. Springer, Berlin (2007). (Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23 (2003))

[24.]

Rio

. Moment inequalities for sums of dependent random variables under projective conditions. J. Theor. Probab., 2009, 22: 146-163,

CrossRef Google scholar

[25.]

Shen, Z., Yang, H., Zhang, S.: Deep network approximation characterized by number of neurons (2020). arXiv:1906.05497

[26.]

van de Geer

. . Applications of Empirical Process Theory. Cambridge Series in Statistical and Probabilistic Mathematics, 2000 Cambridge Cambridge University Press

[27.]

Yarotsky

. Error bounds for approximations with deep ReLU networks. Neural Netw., 2017, 94: 103-114,

CrossRef Google scholar

[28.]

Zhou

D-X

. Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal., 2020, 48(2): 787-794,

CrossRef Google scholar

[29.]

Zou

, Cao

, Zhou

, Gu

. Gradient descent optimizes over-parameterized deep ReLU networks. Mach. Learn., 2019, 109: 467-492,

CrossRef Google scholar

Funding

Deutsche Forschungsgemeinschaft(EXC 2044-390685587)

Accesses

Citations

Detail

Sections

Recommended

Received	Accepted
22 Dec 2020	27 Feb 2022
Issue Date
15 Oct 2024

About the journal

Aims & scope

Description

Editorial board

Abstracting / Indexing

Cover gallery

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Authors & reviewers

Online submisson

Guidelines for authors

Editorial policy

Ethical requirements

Download templates

Abstract

Keywords

Cite this article

{{custom_sec.title}}

{{custom_sec.title}}

References