Vanishing Gradient and Exploding Gradient Problem

Ashwin Jain
4 min readJun 1, 2021

--

Photo by fabio on Unsplash

Introduction

In the early 1990s, scientists faced many issues while creating deep neural networks. They were unable to achieve good accuracy in the model they used to train. The reason behind bad accuracy was improper update in parameters that is, weights and bias. Weights and bias gets updated with multiple epochs of forward propagation and backward propagation while we are training our deep learning model on basis of our inputs and outputs being provided by our dataset so as to reduce the loss in our model and reach global minima.

These change in weights and bias occurs due to derivatives or we can say, gradients. Thes gradients play crucial role in determining accuracy of our model and thus problems in gradients also results in bad accuracy in model. The problem due to gradients are of two types: Vanishing gradient problem and Exploding gradient problem.

Vanishing gradient problem

If we look at the weight updation formula, we can clearly see that weights make change as per derivatives of loss by derivatives of weight. Thus what if such gradients become too small. This is where vanishing gradient comes into picture. As the depth of neural network increases and so our gradient terms becomes exponentially smaller and smaller.

This results in very minute (or no) changes in weights that are being needed to reach global minima so as to get higher accuracy. Thus W(new) ~W(old) and so we are unable to reach global minima. Such problems arises due to

  1. Deeper neural networks: Multiple layers causes chain rule of derivative to reduce the gradient terms exponentially and resulting in very small term.
  2. Activation Function: Choosing activation functions like tanh or sigmoid activation function results in vanishing gradient problem, if we find their derivatives as per chain rule, then it can be seen that they tend to reduce the gradient term value since their value comes out to be much lesser than one.
  3. Learning Rate: Keeping very less learning rate (that is a hyperparameter) results in reduction in gradient term and thus model suffers vanishing gradient problem.

It is necessary for our model to avoid vanishing gradient problem since this problem results in a stationary model and thus we cannot reach global minima. Another necessity of remove this problem is that our model training tends out to be computationally more expensive since model training being too slow.

Exploding Gradient Problem

Exploding gradient problem can be termed as inverse case problem for vanishing gradient problem. In vanishing gradient problem where our gradient term becomes too small, in exploding gradient problem our gradient term is too large for our model training. This results in shooting out of weight over global minima and thus we are unable to get good accuracy from our model. Factors that results in exploding gradient problem are.

  1. Improper weight initialization: If we tend to do weight initialization that is inappropriate, it results in overshooting over required minima.
  2. Learning rate: If we choose learning rate that is quite large, then our training tends to be unstable since our loss function is overshooting over minimum loss but unable to collide with it.

It is necessary to avoid such kind of problem since it is computationally more expensive as our model is unable to reach minima and thus it can be better said as infinite time. This model also tends to make learning quite unstable and stability is a must when it comes to model training.

Conclusion

We can conclude that both the problem are not good for our model. We need to identify these problems as well as try to remove such problems. Some popular tricks to remove such problems is either to decrease the number of hidden layers from our neural network, do proper weight initialization, choose proper activation function over all the layers and to choose hyperparameters that should not be too small that it causes vanishing gradient problem and it should not be too large that is causes exploding gradient problem.

--

--