Just to make it clear, I need to make sure that you really understood why we are finding derivatives and how SGD is used to update the parameters. These two videos would help.
And after this, you need to understand how we find the derivatives in our complex function. I suggest you to go through any of these two resources ( video OR notes from cs231n ), preferably both.
Do not move ahead until you really understood backpropagation and gradient descent. Otherwise you can't really focus on the vectorization part of this.
The notes will explain back propagation using simple computation graphs.
Representing forward prop as a simple computational graph
Let's look at the figure that explains whole process once again. It somewhat looks like a simple computational graph with few multiplications, additions, sigmoid functions at each layer.
As mentioned in the above cs231n article, sigmoid can further divide into operations like, exponential, fractions, etc., So I want you to convince that the whole forward propagation is a big complex computational graph.
Here, instead of showing all those internal operations, we will abstract it with only the high level operations as we know like sigmoid.
Notation
At this point, you must have understood that we need to find the derivative of variables like W3, B3, A2, ... with respect to Loss function(L) so that we can perform the gradient updation step.
If we are finding derivative loss (L)of any intermediate variable like Z3 , ie., ∂Z3∂L as dZ3, derivative of loss (L) with respect to Z3ie., ∂A3∂L as dA3.
On the other hand, the derivative of any intermediate variable with respect to any other intermediate variable would be represented normally, Ex. Derivative of A3with respect to Z3as ∂Z3∂A3 .
Back propagation
First, we will build an intuition on how we can calculate the gradients, for all the weight and bias matrices starting from final layer to the first layer without actually computing the derivatives. At the end, you will be able to write a general representation to calculate the gradients at every layer so that we can write that in a simple for loop when you implement it. Just to get familiarize with the process, you can watch this video.
And then we will deep dive into calculating gradients. Here we will explain how calculating gradients for a single weight attribute and vectorize it for the entire weight matrix at that layer.
Output Layer
From this diagram, we can clearly see that Z3 is a function of W3,A2andB3, and A3is a function of Z3. It is worth mentioning about why sigmoid as an activation function. As this is the last layer, the outputs should be probabilities for each class. Here, we are dealing with a simple binary classification, and thus sigmoid is self sufficient.
If we have more than two classes, we should use softmax as an activation function. At the end, you get to implement that. Whatever the activation function is, it is enough to say that A3 as a function of Z3. As you already know, we can find the gradient of Lw.r.t W3,A2andB3can be written using chain rule as follows.
Take your time and understand the above derivatives as this is the key to make it more generic to all the layers.
Layer - 2 (Second Hidden Layer)
And the gradients for the intermediate variables at this layer would look like as follows. And a little sneak peek, we can easily find the derivative term dZ2as we already found dA2previously.
This is the first hidden layer graph. This is the end for propagating backward to compute the gradients. As you have already done this step for previous layers, It it easy to write the equations to compute the gradients.
Weird!, why didn't we calculate the gradients for A0? It is the input and it is a constant and thus need not to learn while training.
There are times where we might want to learn the input as well. For an example, during word embedding, we input random vector for each word via and we will learn those weights OR vectors for each word through back propagation. Just wanted you to know.
By now, you should get a general Idea on how to generalise the gradient computation at each layer and you should be able to code that out.
Computing Gradients
From above equations, it is clear that we need to find all the dZl′s,wherel=layer. And it is quite different for the output layer and the intermediate layers. Because, at the final layer, the Zconnects to loss function and at the intermediate layers, Zis connected to the activation at that layer. So the generalisation for the gradients of Zis same for all the layers except for the Output layer. You might be confused by reading this. But it will be clear as we go. I want to mention this beforehand just to give you a heads up. Don't worry if you couldn't understand it, it will be clear as we go.
First we will look into the derivatives in first hidden layer instead of output layer. It is better because it generalises for layer that has multiple neurons in both prev and current layer. In our case, we have only one neuron in the output layer. So, we go from left to right.
While coding it out, there is dependency in the gradient computation. so it has to be backwards. Here, just for the sake of explaining things, we go from left to right. Hope this is clear enough.
Layer - 1(First Hidden Layer)
In the above diagram, it is important to mention few things.
dZ1is what we have to find.
dA1assumed to be computed previously while we compute the gradients in second hidden layer.
∗→element-wise multiplication.
We still need to find ∂W1∂Z1and ∂B1∂Z1 in order to find dW1and dB1 as dW1=∂W1∂Z1dZ1and dB1=∂B1∂Z1dZ1.
⋯in the matrices represents matrix broadcasting. It happens within numpy or tensorflow to do the matrix multiplication.
The shapes that are provided considering only one input example. If you pass more than one single point, the shapes will change.
Even for a batch of input, the final vectorization remains same. except that you need to divide the matrix multiplication by the size of the input. In order to understand that, try to work out the forward propagation with batch of inputs and backward propagation with the same.
Layer - 2 (Second Hidden Layer)
We need to find ∂W2∂Z2and ∂B2∂Z2 in order to find dW2and dB2 as dW2=∂W2∂Z2dZ2and dB2=∂B2∂Z2dZ2. As mentioned previously, dA2is assumed to be pre-computed. ie., it would've been computed at the Output layer (next layer in the sequence) itself.
dW2anddB2will be calculated exactly as above. But, I would strongly recommend to do it again by yourself to verify the same.
Lastly, we need to calculate a different kind of gradient at the final layer, ie.,dZ3.
Output Layer
Unlike hidden layers, the derivative for dZ3is little different. Because A3is connected to loss function directly. So the derivative that we found previously doesn't hold.
The gradients are in the red and inputs are in green. We are trying to compute dZ3ie., Derivative w.r.t Loss, we find the intermediate gradients and multiply them ( chain rule of differentiation). The remaining derivatives for W3,A2,andB3are same for this layer too and are mentioned below.
classDNN(): ... ... ... ...defforward(self,X):""" X : input returns out(last layer), cache """defcompute_gradients(self,out,cache):""" out : output of last layer(output layer) ie., A_3 returns: dictionary that has all the gradients for all the variables """ grads ={}# output layer cur_layer =3 grads[f'Z{cur_layer}']= ... # grads['Z3']# compute grads for 'W3', 'B3' and 'A2' as well# gradients for the other layers# loop from last hidden layer to first hidden layer# compute and store the gradients in the `grads` dictionary.# ex: For second hidden layer, we need# 1.`A2` and `dA2`(computed in prev loop) --> to calculate `dZ2`,# 2.`dZ2` and `A1` --> to calculate `dW2` # 3.`dZ2` --> to calculate `dB2`# 4.`W2` and `Z2` --> To calculate `dA1`(useful in next loop)# so, during forward propagation, we need A2(output), A1(input),# W2 and Z2# for the first hidden layer, computing `A0`(input) is not required.return grads ... ... ... ...
For batch input, the gradients will change as follows
dB3has multiple columns instead of single column. Each column represents gradient for single training example. If the input size is m, there would be mcolumns. In order to get the cumulative gradient, we need to sum it up and then divide by mto get the average gradient.
dW3has cumulative gradients in each entry while calculating. The shape doesn't change. So, we just need to divide in order to get the average gradient.
You will understand this if you work out some example with more than one input. In this case the input size is 2×m, where mis the number of input examples.
You need to write the program that works for batch input rather than single input. Even though I explained about the changes in the formula, It's better to work it out with an example before implementing it.
Task
Take a batch of input of shape (2 x m), where m is the number of input samples and work out the backward and forward propagation and change the gradient formulas wherever necessary.
Observe all the gradient formulas, and identify the required matrices at each layer to compute gradients in that layer and take a note of them.
Modify the forward propagation to store all the intermediate matrices which you have observed in above step and store them. (You can use dictionaries to store the variables with keys variable+layer_numer Ex: Use W1 to store the weight matrix at first hidden layer. Use A2as the key to store the outputs from second hidden layer etc.,.)
Write a function gradients that will calculate gradients for all those weight and bias matrices at all the layers. The cache that you have stored during the forward propagation will come in handy here.