Back propagation
Prerequisites
Just to make it clear, I need to make sure that you really understood why we are finding derivatives and how SGD is used to update the parameters. These two videos would help.
And after this, you need to understand how we find the derivatives in our complex function. I suggest you to go through any of these two resources ( video OR notes from cs231n ), preferably both.
Do not move ahead until you really understood backpropagation and gradient descent. Otherwise you can't really focus on the vectorization part of this.
The notes will explain back propagation using simple computation graphs.
Representing forward prop as a simple computational graph
Let's look at the figure that explains whole process once again. It somewhat looks like a simple computational graph with few multiplications, additions, sigmoid functions at each layer.

As mentioned in the above cs231n article, sigmoid can further divide into operations like, exponential, fractions, etc., So I want you to convince that the whole forward propagation is a big complex computational graph.
Here, instead of showing all those internal operations, we will abstract it with only the high level operations as we know like sigmoid.
Notation
At this point, you must have understood that we need to find the derivative of variables like W3, B3, A2, ... with respect to Loss function(L) so that we can perform the gradient updation step.
If we are finding derivative loss of any intermediate variable like , ie., , derivative of loss with respect to ie., .
On the other hand, the derivative of any intermediate variable with respect to any other intermediate variable would be represented normally, Ex. Derivative of with respect to as .
Back propagation
First, we will build an intuition on how we can calculate the gradients, for all the weight and bias matrices starting from final layer to the first layer without actually computing the derivatives. At the end, you will be able to write a general representation to calculate the gradients at every layer so that we can write that in a simple for loop when you implement it. Just to get familiarize with the process, you can watch this video.
And then we will deep dive into calculating gradients. Here we will explain how calculating gradients for a single weight attribute and vectorize it for the entire weight matrix at that layer.
Output Layer

From this diagram, we can clearly see that is a function of , and is a function of . It is worth mentioning about why sigmoid as an activation function. As this is the last layer, the outputs should be probabilities for each class. Here, we are dealing with a simple binary classification, and thus sigmoid is self sufficient.
If we have more than two classes, we should use softmax as an activation function. At the end, you get to implement that. Whatever the activation function is, it is enough to say that as a function of . As you already know, we can find the gradient of w.r.t can be written using chain rule as follows.
Take your time and understand the above derivatives as this is the key to make it more generic to all the layers.
Layer - 2 (Second Hidden Layer)

And the gradients for the intermediate variables at this layer would look like as follows. And a little sneak peek, we can easily find the derivative term as we already found previously.
Layer - 1(First Hidden Layer)

This is the first hidden layer graph. This is the end for propagating backward to compute the gradients. As you have already done this step for previous layers, It it easy to write the equations to compute the gradients.
Weird!, why didn't we calculate the gradients for ? It is the input and it is a constant and thus need not to learn while training.
By now, you should get a general Idea on how to generalise the gradient computation at each layer and you should be able to code that out.
Computing Gradients
From above equations, it is clear that we need to find all the . And it is quite different for the output layer and the intermediate layers. Because, at the final layer, the connects to loss function and at the intermediate layers, is connected to the activation at that layer. So the generalisation for the gradients of is same for all the layers except for the Output layer. You might be confused by reading this. But it will be clear as we go. I want to mention this beforehand just to give you a heads up. Don't worry if you couldn't understand it, it will be clear as we go.
First we will look into the derivatives in first hidden layer instead of output layer. It is better because it generalises for layer that has multiple neurons in both prev and current layer. In our case, we have only one neuron in the output layer. So, we go from left to right.
While coding it out, there is dependency in the gradient computation. so it has to be backwards. Here, just for the sake of explaining things, we go from left to right. Hope this is clear enough.
Layer - 1(First Hidden Layer)

We still need to find and in order to find and as and .

Layer - 2 (Second Hidden Layer)

We need to find and in order to find and as and . As mentioned previously, is assumed to be pre-computed. ie., it would've been computed at the Output layer (next layer in the sequence) itself.

will be calculated exactly as above. But, I would strongly recommend to do it again by yourself to verify the same.
Lastly, we need to calculate a different kind of gradient at the final layer, ie.,.
Output Layer

Unlike hidden layers, the derivative for is little different. Because is connected to loss function directly. So the derivative that we found previously doesn't hold.
The gradients are in the red and inputs are in green. We are trying to compute ie., Derivative w.r.t Loss, we find the intermediate gradients and multiply them ( chain rule of differentiation). The remaining derivatives for are same for this layer too and are mentioned below.

Pseudo code
class DNN():
...
...
...
...
def forward(self, X):
"""
X : input
returns out(last layer), cache
"""
def compute_gradients(self, out, cache):
"""
out : output of last layer(output layer) ie., A_3
returns: dictionary that has all the gradients for all
the variables
"""
grads = {}
# output layer
cur_layer = 3
grads[f'Z{cur_layer}'] = ... # grads['Z3']
# compute grads for 'W3', 'B3' and 'A2' as well
# gradients for the other layers
# loop from last hidden layer to first hidden layer
# compute and store the gradients in the `grads` dictionary.
# ex: For second hidden layer, we need
# 1.`A2` and `dA2`(computed in prev loop) --> to calculate `dZ2`,
# 2.`dZ2` and `A1` --> to calculate `dW2`
# 3.`dZ2` --> to calculate `dB2`
# 4.`W2` and `Z2` --> To calculate `dA1`(useful in next loop)
# so, during forward propagation, we need A2(output), A1(input),
# W2 and Z2
# for the first hidden layer, computing `A0`(input) is not required.
return grads
...
...
...
...
For batch input, the gradients will change as follows
has multiple columns instead of single column. Each column represents gradient for single training example. If the input size is , there would be columns. In order to get the cumulative gradient, we need to sum it up and then divide by to get the average gradient.
has cumulative gradients in each entry while calculating. The shape doesn't change. So, we just need to divide in order to get the average gradient.
You will understand this if you work out some example with more than one input. In this case the input size is , where is the number of input examples.
You need to write the program that works for batch input rather than single input. Even though I explained about the changes in the formula, It's better to work it out with an example before implementing it.
Task
Take a batch of input of shape (2 x m), where m is the number of input samples and work out the backward and forward propagation and change the gradient formulas wherever necessary.
Observe all the gradient formulas, and identify the required matrices at each layer to compute gradients in that layer and take a note of them.
Modify the forward propagation to store all the intermediate matrices which you have observed in above step and store them. (You can use dictionaries to store the variables with keys
variable+layer_numer
Ex: UseW1
to store the weight matrix at first hidden layer. UseA2
as the key to store the outputs from second hidden layer etc.,.)Write a function
gradients
that will calculate gradients for all those weight and bias matrices at all the layers. The cache that you have stored during the forward propagation will come in handy here.
Last updated
Was this helpful?