1. Department of Computer Science, Faculty of Mathematical Science, University of Kashan, Kashan 87317-53153, Iran
2. Department of Applied Mathematics, Shiraz University of Technology, Shiraz 13876-71557, Iran
3. Department of Mathematical Sciences, University of Kashan, Kashan 87317-53153, Iran
s.fathi@sutech.ac.ir
Show less
History+
Received
Accepted
Published
2023-03-25
2023-10-13
2025-01-15
Issue Date
Revised Date
2023-10-18
PDF
(5696KB)
Abstract
In this paper, we propose a novel warm restart technique using a new logarithmic step size for the stochastic gradient descent (SGD) approach. For smooth and non-convex functions, we establish an convergence rate for the SGD. We conduct a comprehensive implementation to demonstrate the efficiency of the newly proposed step size on the FashionMinst, CIFAR10, and CIFAR100 datasets. Moreover, we compare our results with nine other existing approaches and demonstrate that the new logarithmic step size improves test accuracy by 0.9% for the CIFAR100 dataset when we utilize a convolutional neural network (CNN) model.
Stochastic gradient descent (SGD), which dates back to the work by Robbins and Monro [1] is widely observed in training modern Deep Neural Networks (DNNs), which are widely used to achieve state-of-the-art results in multiple problem domains like image classification problems [2, 3], object detection [4], and classification automatic machine translation [5].
The value of the step size (or learning rate) is crucial for the convergence rate of SGD. Selecting an appropriate step size value in each iteration ensures that SGD iterations converge to an optimal solution. If the step size value is too large, it may prevent SGD iterations from reaching the optimal point. Conversely, excessively small step size values can lead to slow convergence or mistakenly identify a local minimum as the optimal solution [6]. To address these challenges, various schemes have been proposed. One popular approach is the Armijo line search method, initially introduced for SGD by Vaswani et al. [7], which provides theoretical results for strong-convex, convex, and non-convex objective functions. Gower et al. [8] proposed another approach that combines a constant step size with a decreasing step size. Their algorithm starts with a constant step size and switches to a decreasing step size after iterations, where denotes the problem’s condition number. This strategy guarantees convergence for strongly convex functions but requires knowledge of the condition number and is not applicable to non-convex functions. Additionally, the decay step size is commonly used in practice for solving non-convex problems encountered during DNN training [2, 9].
Smith in [10] proposed an efficient method for setting the step size known as the cyclical learning rate. Training neural networks with cyclical learning rates instead of fixed learning rates can lead to significant improvements in accuracy without the need for manual tuning, often requiring fewer iterations. Loshchilov and Hutter [11] utilized this strategy and proposed a warm restart technique for SGD that does not need to apply gradient information for updating step size in each iteration. The key idea behind warm restarts is that in each restart, the learning rate is initialized to a specific value (denoted as ) and scheduled to decrease. Moreover, it has been shown that a warm restarted SGD takes 2–4 times less time than a traditional learning rate adjustment strategy [12]. In recent years, numerous step sizes with warm restarts have been introduced [6, 13]. Vrbančič extended the idea of the cosine step size and proposed three different step sizes with warm restarts [12].
The convergence rate, from a theoretical standpoint, is a crucial metric for assessing the effectiveness and efficiency of an algorithm. In the case of the SGD algorithm, the convergence rate depends on various factors, such as the type of objective function (e.g., strong convex, convex, or non-convex) and the value of the step size used in each iteration. For convex and smooth functions with Lipschitz continuous gradients, it has been proven that the SGD algorithm can achieve a convergence rate of [14, 15]. The rate of convergence for a strongly convex function is [16, 17]. When the goal is smooth and non-convex functions, Ghadimi and Lan [15] provided an rate of convergence for SGD by a constant step size, i.e., . In another work, Li et al. [18] established an rate of convergence for both cosine and exponential step sizes which were first considered in [11]. Moreover, their empirical results confirm that both step sizes have the best accuracy and training loss achievements. Ge et al. [19] considered decay step size for solving the least squares problems. Wang et al. [20] took into account the influence of the probability distribution, denoted as , on both the implementation and theoretical aspects of the algorithm. They demonstrated that within this framework, the SGD algorithm with the exponential step size achieves a convergence rate of . Notably, their findings reveal that assigning a higher probability to the final iterations through the step size leads to enhancements in accuracy and improvements in the loss function.
In the context of many decay step sizes, the convergence rate of the SGD algorithm is commonly analyzed by considering specific values for the parameters associated with the step size. Several studies, including [15, 18, 20], have explored and derived convergence rates for the SGD algorithm by imposing constraints or assumptions on the initial step size or the parameters related to the step size. These conditions play a crucial role in ensuring the convergence properties of the algorithm. Tab.1 presents the convergence rates of various decay step sizes. Specifically, the second column of the table indicates the conditions that need to be satisfied by the initial step size or the parameters associated with the step size in order to achieve the desired convergence properties. As presented in Tab.1, to achieve a convergence rate of order , Li et al. [18] made the assumption and . This assumption implies that the value of the initial step size, denoted as , will be very small. It is evident that for large values of , the initial step size will be extremely tiny, potentially impacting the algorithm’s efficiency.
1.1 Contribution
Motivated by the aforementioned studies, our work introduces a novel logarithmic step size for stochastic gradient descent (SGD) with the warm restarts technique. The main contributions of this paper can be summarized as follows:
● The new proposed step size offers a significant advantage over the cosine step size [18] in terms of its probability distribution, denoted as in Theorem 1. This distribution plays a crucial role in determining the likelihood of selecting a specific output during the iterations. Fig.1(a) illustrates that the cosine step size assigns a higher probability of selection to the initial iterations but substantially reduces the probability for the final iterations, approaching zero. In contrast, the new step size proves to be more effective for the final iterations, as they enjoy a higher probability of selection compared to the cosine step size. Consequently, the new step size method outperforms the cosine step size method when it comes to the final iterations, benefiting from their increased likelihood of being chosen as the selected solution.
● For the new step size, we establish the convergence results of the SGD algorithm. By considering that , which leads to the initial value of the step size is greater than the initial value of the step length mentioned in [18], we demonstrate a convergence rate of for a smooth non-convex function. This convergence rate meets the best-known rate of convergence for smooth non-convex functions.
● We conduct a comparative experimental analysis of the new logarithmic step size to nine popular step sizes, including Armijo line search [7], cosine step size [18], Adam, , , constant step size, reduceLROnPlateau, stagewise - 1 milestone, stagewise - 2 milestones. We compare the performance of our proposed approach with that of the state-of-the-art methods on the FashionMinst, CIFAR10, and CIFAR100 datasets. We demonstrate the effectiveness of the newly proposed step size in enhancing the accuracy and training loss of the SGD on popular datasets such as FashionMNIST, CIFAR10, and CIFAR100 (e.g., see Fig.1(c)). Notably, we observe a significant accuracy improvement of specifically for the CIFAR100 dataset when we use a convolutional neural network model.
The remaining sections of this paper are structured as follows: In Section 2, we introduce the new logarithmic step size and discuss its properties. Section 3 is dedicated to the analysis of convergence rates for the proposed step size on smooth non-convex functions. We establish that the SGD method achieves an impressive convergence rate of . In Section 4, we present and discuss the numerical results obtained using the new decay step size. Finally, in Section 5, we summarize our findings and draw conclusions based on our study.
In this paper, we use the following notational conventions:
● The Euclidean norm of a vector is denoted by .
● The nonnegative orthant and positive orthant of are denoted by and , respectively.
● We use the notation to indicate that there exists a positive constant such that for all .
2 Problem set-up
We consider the following optimization problem:
where is the loss function for the th training sample over the variable and denotes the number of samples. This minimization problem is central in machine learning. Several iterative approaches for solving Eq. (1) are known [21], and SGD is particularly popular when the dimensionality, , is extremely large. SGD uses a random training sample to update using the rule:
in which is the step size used in iteration and is the (average) gradient of the loss function(s) [7].
Our analysis and results in this paper are based on the following assumptions.
● Assumption 1 The function is differentiable and L-smooth, that is, for all there exists a constant such that:
● Assumption 2 For any random , we have
● Assumption 3 The objective function is bounded below on .
2.1 The new step size
It has been observed that larger values of the step size provide the model with sufficient energy to escape from critical points in the initial iterations, while smaller step sizes guide the model towards well-behaved local minima in the final iterations [6]. However, when the step length of a function takes on large values over multiple iterations and rapidly tends to zero towards the end, it adversely affects the behavior of the distribution function denoted as . Consequently, based on the conditions described in Theorem 1, the probability of selecting points from the final iterations decreases significantly compared to the initial iterations. To address these issues, we propose an appropriate step size that its values gradually decrease during the iterations. In this context, we introduce a logarithmic step size that exhibits slower convergence to zero compared to many other step sizes, yet converges faster than the cosine step size. The new step size is defined as follows:
Definition 1 For a constant , the new logarithmic step size for SGD is given by
where represents the number of iterations in each inner loop of the algorithm and represents the number of epochs performed since the last restart. To use the warm restarts policy, we follow the model proposed in [10], which states that the training process is divided into cycles (i.e., the outer loop of Algorithm 1), and the algorithm runs each cycle in epochs. Each cycle begins with the largest value of step size . During the running of a cycle, the value of the learning rate decreases by using Eq. (4). In addition, when the step size will be output, i.e., . The behaviours of the new proposed step size for different values of are shown in Fig.1(b). It is clear that for small values of , that is, , the algorithm performs more inner loop, and the values of step size quickly go to zeros, but when , the algorithm performs one inner loop and the step size goes to zeros very slowly.
The lemma that follows gives us some properties of the function . We will utilize them frequently in the rest of the paper.
Lemma 1 For the function , we have
●
● .
● , for all .
3 Algorithm and convergence
We use the warm restart Algorithm 1 with condition which means the algorithm runs with the same number of epochs in the inner loop. Algorithm 1 begins with the given initial step size , the number of inner iterations , the number of outer epochs , and the starting point . The algorithm consists of two loops, i.e., outer and inner loops. In each inner loop, the SGD with the logarithmic step size is performed.
3.1 Convergence results for smoothness function
We present the convergence bound for SGD with logarithmic step sizes when the objective function is a smooth and non-convex function. The lemma that follows provides a lower bound for .
Lemma 2 For the new proposed step size given by Eq. (4), we have
The next lemma gives an upper bound for .
Lemma 3 For the step size given by Eq. (4), we have
Lemma 4 Under Assumptions 1 and 2, and . Then, Algorithm 1 with the new proposed step size guarantees:
The next theorem gives an upper bound for .
Theorem 1 Under Assumptions 1 and 2, and . SGD with the new proposed step sizes with guarantees:
where is a random iterate drawn from the sequence with probability .
Proof Using the definition of and Lemma 4, we have
where the third inequality is obtained by using Lemmas 2 and 3. □
As mentioned earlier, selecting a smaller value for the parameter compared to the cosine function used in [18] enables us to achieve the optimal convergence rate for SGD based on the new logarithmic step size. Specifically, when considering , we have the following result:
Corollary 1 Based on the Theorem 1, if , setting , we have
Now, leveraging the results obtained from Theorem 1, we can compute the convergence rate for the warm restart SGD algorithm.
Corollary 2 (SGD with warm restarts): Under Assumptions A1, A2, and A3, for a given value of , where and , Algorithm 1 ensures the following convergence guarantees:
where for .
4 Numerical results
4.1 Experiments on MNIST, CIFAR10 and CIFAR100
In this section, we evaluate our proposed algorithm’s performance on image classification tasks. We compare the performance of the proposed approach with that of the state-of-the-art methods on the FashionMinst, CIFAR10, and CIFAR100 datasets [3]. FashionMinst is a dataset consisting of a training set of 60000 examples of grayscale images and a test set of 10000 examples. Each image is pixels. We use the convolutional neural network (CNN) model for the classification task on this dataset. This model has two convolutional layers with kernel sizes of and padding of , two max-pooling layers with kernel sizes of , and two fully connected layers with hidden nodes. The activation function of the hidden nodes is the rectified linear unit (ReLU). To prevent overfitting, we use a dropout with a probability of in the hidden layer of the deep model. The number of input neurons is , and the number of output neurons is . To compare the performance of various algorithms, we use the cross-entropy function as a loss function and an accuracy metric.
The CIFAR10 dataset consists of 60,000 of color images of 10 classes, each with 6000 images. There are 50000 training images and 10000 test images. The batch size is . It means that each epoch of the training includes iterations. The deep learning architecture used for evaluating the performance of the algorithms on this dataset is the -layer residual neural network (ResNet) [22]. This model uses cross-entropy as the loss function.
The CIFAR100 dataset is just like the CIFAR10, except it has 100 classes containing 600 natural images each. There are 500 training images and 100 testing images per class. There are images available for training and images available for testing. Randomly cropped and flipped images are used for training. We have conducted two series of experiments on this dataset based on two deep learning models. The first one is the same as that used on FashionMinst which is described above. The reason for using this model may be the limitation in the hardware available to run the experiments. The second one is a DenseNet-BC model with layers and a growth rate of [9].
To optimize the hyperparameter of the new proposed step decay on the CIFAR100 dataset, we employed a two-stage grid search on the validation set. Initially, we explored a rough grid and picked the hyperparameters that produced the best validation outcomes. Subsequently, we conducted a more refined search centered around the most effective hyperparameters identified in the first stage and chose the optimal set of hyperparameters as the final selection. For the starting step size , we used a coarse search grid of {0.00001, 0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}, and a fine grid of {0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59}. The best value of was obtained as . We compared the performance of our proposed method with state-of-the-art methods whose hyperparameter values were fine-tuned in a previous study [18]. We adopted the same hyperparameter values in our own numerical studies. The values of the hyperparameters are given in Tab.2. In the experiments on the CIFAR100 dataset using Convolutional Neural Network (CNN), the parameter is set , which is similar to the cosine step decay method. The momentum parameter is consistently set to across all methods and datasets mentioned. Additionally, weight-decay is set to for FashionMnist and CIFAR10, and for CIFAR100 in all mentioned methods. Furthermore, a batch size of 128 is used in all of the experiments mentioned. The parameter is the ratio of training samples to the batch size. The experiments are repeated with random seeds for times to eliminate the influence of stochasticity.
4.2 Methods
We consider SGD with the following step sizes:
●
●
●
●
●
which are respectively named with SGD constant step size, step Size, step Size, cosine, and the new logarithmic step size. Parameter represents the iteration number of the inner loop and each outer iteration consists of iterations for training on mini-batches. Moreover, we compare the results of the new proposed logarithmic step size with Adam [23], SGD+Armijo [7], PyTorch’s ReduceLROnPlateau scheduler5 (abbreviated as ReduceLROnPlateau), and stagewise step size. Note that, similar to [18], the term “stagewise” refers to the Stagewise - 1 Milestone and Stagewise - 2 Milestone methods. (As a side note, since we use Nesterov momentum in all SGD variants, the stagewise step decay basically covers the performance of multistage accelerated algorithms (e.g., [24]). In all experiments, we follow the setting proposed in [18].
4.3 Results and discussion
We compare the methods in two groups as illustrated in Fig.2 and Fig.3 for more clarity of figures. In Fig.2, the new proposed step size achieves the training loss close to zero as the best-known method such as SGD+Armijo after epoch number 40 in the FashionMnist dataset while its test accuracy outperforms all methods. In the CIFAR10 dataset, the new logarithmic step size is as good as the best previously studied method (i.e., step size). In the CIFAR10 dataset, the new proposed approach outperforms all the other methods both based on the training loss and the test accuracy. On the other hand, the new logarithmic step size achieves the best training loss, but the stagewise step decay step size works well in terms of accuracy than the other step size for the CIFAR10 dataset as illustrated in Fig.3.
As previously mentioned, we assessed the performance of the new proposed step size on the CIFAR100 dataset using two deep models for image classification. The results, depicted in Fig.2 and Fig.3, demonstrate that the new proposed step size outperforms all other methods in both groups when the deep model is a convolutional neural network (CNN). Overall, the use of the DenseNet-BC model results in better performance for all step decays, as shown in Fig.4 and Fig.5. Notably, the new proposed step size performs better than the state-of-the-art method (i.e., cosine step decay).
Tab.3–Tab.5 demonstrate the average of the final training loss and test accuracy obtained by 5 runs starting from different random seeds on FashionMNIST, CIFAR10, and CIFAR100 datasets, respectively.
From Tab.3–Tab.5, we can conclude that:
● From Tab.3, the SGD+Armijo step size obtained the best training loss, however, the proposed step size improved the test accuracy of the SGD by .
● Tab.4 shows that the new proposed step size has the best training loss and the cosine step size has the best test accuracy.
● From Tab.5 we can conclude that the new proposed step size can improve the test accuracy of the SGD by .
● Based on the information presented in Tab.6, it can be inferred that the new proposed step size leads to a improvement in training loss and a improvement in test accuracy compared to the cosine step decay method.
5 Conclusion
We introduced a new logarithmic step size with warm restarts technique for the SGD. We showed that the new step size goes to zero more slowly than most existing step sizes. We showed that the new logarithmic step size achieves the rate of for the SGD with a smooth non-convex objective function. Finally, to prove the effectiveness of the new proposed step size in practice, we did a wide range of implementations on three famous datasets, i.e., FashionMinst, CIFAR10, and CIFAR100 datasets, and compared the obtained results with 9 other step sizes. For the CIFAR100 datasets, the proposed step size improved the accuracy of the algorithm by when we utilized a convolutional neural network model. For two other datasets, the new proposed step size obtained good results.
Further investigation into the convergence rate of SGD based on the logarithmic step size is warranted for convex and strongly convex objective functions. Additionally, while this paper focuses on the convergence rate for smooth non-convex functions that do not satisfy the PL condition, it would be interesting to examine the convergence rate for smooth non-convex functions that satisfy the PL condition.
Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22( 3): 400–407
[2]
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
[3]
Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Toronto: University of Toronto, Department of Computer Science, 2009
[4]
Redmon J, Farhadi A. Yolo9000: better, faster, stronger. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6517−6525
[5]
Zhang J, Zong C. Deep neural networks in machine translation: an overview. IEEE Intelligent Systems, 2015, 30( 5): 16–25
[6]
Mishra P, Sarawadekar K. Polynomial learning rate policy with warm restart for deep neural network. In: Proceedings of 2019 IEEE Region 10 Conference (TENCON). 2019, 2087−2092
[7]
Vaswani S, Mishkin A, Laradji I, Schmidt M, Gidel G, Lacoste-Julien S. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 335
[8]
Gower R M, Loizou N, Qian X, Sailanbayev A, Shulgin E, Richtárik P. SGD: General analysis and improved rates. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5200−5209
[9]
Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2261–2269
[10]
Smith L N. Cyclical learning rates for training neural networks. In: Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017, 464−472
[11]
Loshchilov I, Hutter F. SGDR: Stochastic gradient descent with warm restarts. In: Proceedings of the 5th International Conference on Learning Representations. 2017
[12]
Vrbančič G, Podgorelec V. Efficient ensemble for image-based identification of pneumonia utilizing deep CNN and SGD with warm restarts. Expert Systems with Applications, 2022, 187: 115834
[13]
Xu G, Cao H, Dong Y, Yue C, Zou Y. Stochastic gradient descent with step cosine warm restarts for pathological lymph node image classification via PET/CT images. In: Proceedings of the 5th IEEE International Conference on Signal and Image Processing (ICSIP). 2020, 490−493
[14]
Nemirovski A, Juditsky A, Lan G, Shapiro A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009, 19( 4): 1574–1609
[15]
Ghadimi S, Lan G H. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013, 23( 4): 2341–2368
[16]
Bach F, Moulines E. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 451−459
[17]
Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, 2011
[18]
Li X, Zhuang Z, Orabona F. A second look at exponential and cosine step sizes: Simplicity, adaptivity, and performance. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 6553−6564
[19]
Ge R, Kakade S M, Kidambi R, Netrapalli P. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, 1341
[20]
Wang X, Magnússon S, Johansson M. On the convergence of step decay step-size for stochastic optimization. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021, 14226−14238
[21]
Nocedal J, Wright S J. Numerical Optimization. New York: Springer, 1999
[22]
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778
[23]
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
[24]
Aybat N S, Fallah A, Gurbuzbalaban M, Ozdaglar A. A universally optimal multistage accelerated stochastic gradient method. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 765
[25]
Thomas G B, Finney R L, Weir M D, Giordano F R. Thomas’ Calculus, Early Transcendentals. 10th ed. Boston: Addison Wesley, 2002
RIGHTS & PERMISSIONS
Higher Education Press
AI Summary 中Eng×
Note: Please be aware that the following content is generated by artificial intelligence. This website is not responsible for any consequences arising from the use of this content.