Back propagation

Prerequisites

Just to make it clear, I need to make sure that you really understood why we are finding derivatives and how SGD is used to update the parameters. These two videos would help.

And after this, you need to understand how we find the derivatives in our complex function. I suggest you to go through any of these two resources ( video OR notes from cs231n ), preferably both.

Do not move ahead until you really understood backpropagation and gradient descent. Otherwise you can't really focus on the vectorization part of this.

The notes will explain back propagation using simple computation graphs.

CS231n Convolutional Neural Networks for Visual Recognition

Representing forward prop as a simple computational graph

Let's look at the figure that explains whole process once again. It somewhat looks like a simple computational graph with few multiplications, additions, sigmoid functions at each layer.

As mentioned in the above cs231n article, sigmoid can further divide into operations like, exponential, fractions, etc., So I want you to convince that the whole forward propagation is a big complex computational graph.

Here, instead of showing all those internal operations, we will abstract it with only the high level operations as we know like sigmoid.

Notation

At this point, you must have understood that we need to find the derivative of variables like W3, B3, A2, ... with respect to Loss function(L) so that we can perform the gradient updation step.

If we are finding derivative loss $\mathbf {(L)}$ of any intermediate variable like $\small \mathbf {Z_3}$ , ie., $\large \bold {\frac {\partial \mathbf L} {\partial Z_3}} \small \text { as } \small \mathbf {dZ_3}$ , derivative of loss $\mathbf {(L)}$ with respect to $\small \mathbf {Z_3}$ ie., $\large \bold {\frac {\partial \mathbf L} {\partial A_3}} \small \text { as } \small \mathbf {dA_3}$ .

On the other hand, the derivative of any intermediate variable with respect to any other intermediate variable would be represented normally, Ex. Derivative of $\small \mathbf {A_3}$ with respect to $\small \mathbf {Z_3}$ as $\large \bold {\frac {\partial \mathbf {A_3}} {\partial Z_3}}$ .

Back propagation

First, we will build an intuition on how we can calculate the gradients, for all the weight and bias matrices starting from final layer to the first layer without actually computing the derivatives. At the end, you will be able to write a general representation to calculate the gradients at every layer so that we can write that in a simple for loop when you implement it. Just to get familiarize with the process, you can watch this video.

And then we will deep dive into calculating gradients. Here we will explain how calculating gradients for a single weight attribute and vectorize it for the entire weight matrix at that layer.

Output Layer

From this diagram, we can clearly see that $Z_3$ is a function of $W_3, A_2 \ and\ B_3$ , and $A_3$ is a function of $Z_3$ . It is worth mentioning about why sigmoid as an activation function. As this is the last layer, the outputs should be probabilities for each class. Here, we are dealing with a simple binary classification, and thus sigmoid is self sufficient.

If we have more than two classes, we should use softmax as an activation function. At the end, you get to implement that. Whatever the activation function is, it is enough to say that $A_3$ as a function of $Z_3$ . As you already know, we can find the gradient of $L$ w.r.t $W_3, A_2 \ and\ B_3$ can be written using chain rule as follows.

\gdef\deriv#1#2{\frac {\partial #1} {\partial #2}} \gdef\layer{3} \gdef\prevlayer{2} \begin{aligned} \deriv{L}{W_\layer} &= \deriv{Z_\layer}{W_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{W_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{W_\layer} dZ_\layer &\implies \boxed {dW_\layer = \deriv{Z_\layer}{W_\layer} dZ_\layer} \\\\ \deriv{L}{A_\prevlayer} &= \deriv{Z_\layer}{A_\prevlayer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{A_\prevlayer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{A_\prevlayer} dZ_\layer &\implies \boxed {dA_\prevlayer = \deriv{Z_\layer}{A_\prevlayer} dZ_\layer} \\\\ \deriv{L}{B_\layer} &= {\deriv{Z_\layer}{B_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}}} = \deriv{Z_\layer}{B_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{B_\layer} dZ_\layer &\implies \boxed {dB_\layer= \deriv{Z_\layer}{B_\layer} dZ_\layer} \\ \end{aligned}

Take your time and understand the above derivatives as this is the key to make it more generic to all the layers.

Layer - 2 (Second Hidden Layer)

And the gradients for the intermediate variables at this layer would look like as follows. And a little sneak peek, we can easily find the derivative term $dZ_2$ as we already found $dA_2$ previously.

\gdef\deriv#1#2{\frac {\partial #1} {\partial #2}} \gdef\layer{2} \gdef\prevlayer{1} \begin{aligned} \deriv{L}{W_\layer} &= \deriv{Z_\layer}{W_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{W_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{W_\layer} dZ_\layer &\implies \boxed {dW_\layer = \deriv{Z_\layer}{W_\layer} dZ_\layer} \\\\ \deriv{L}{A_\prevlayer} &= \deriv{Z_\layer}{A_\prevlayer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{A_\prevlayer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{A_\prevlayer} dZ_\layer &\implies \boxed {dA_\prevlayer = \deriv{Z_\layer}{A_\prevlayer} dZ_\layer} \\\\ \deriv{L}{B_\layer} &= {\deriv{Z_\layer}{B_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}}} = \deriv{Z_\layer}{B_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{B_\layer} dZ_\layer &\implies \boxed {dB_\layer= \deriv{Z_\layer}{B_\layer} dZ_\layer} \\ \end{aligned}

Layer - 1(First Hidden Layer)

This is the first hidden layer graph. This is the end for propagating backward to compute the gradients. As you have already done this step for previous layers, It it easy to write the equations to compute the gradients.

\gdef\deriv#1#2{\frac {\partial #1} {\partial #2}} \gdef\layer{1} \gdef\prevlayer{0} \begin{aligned} \deriv{L}{W_\layer} &= \deriv{Z_\layer}{W_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{W_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{W_\layer} dZ_\layer &\implies \boxed {dW_\layer = \deriv{Z_\layer}{W_\layer} dZ_\layer} \\\\ \deriv{L}{B_\layer} &= {\deriv{Z_\layer}{B_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}}} = \deriv{Z_\layer}{B_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{B_\layer} dZ_\layer &\implies \boxed {dB_\layer= \deriv{Z_\layer}{B_\layer} dZ_\layer} \\ \end{aligned}

Weird!, why didn't we calculate the gradients for $A_0$ ? It is the input and it is a constant and thus need not to learn while training.

There are times where we might want to learn the input as well. For an example, during word embedding, we input random vector for each word via and we will learn those weights OR vectors for each word through back propagation. Just wanted you to know.

By now, you should get a general Idea on how to generalise the gradient computation at each layer and you should be able to code that out.

Computing Gradients

From above equations, it is clear that we need to find all the $dZ_l\ 's, where\ l=layer$ . And it is quite different for the output layer and the intermediate layers. Because, at the final layer, the $Z$ connects to loss function and at the intermediate layers, $Z$ is connected to the activation at that layer. So the generalisation for the gradients of $Z$ is same for all the layers except for the Output layer. You might be confused by reading this. But it will be clear as we go. I want to mention this beforehand just to give you a heads up. Don't worry if you couldn't understand it, it will be clear as we go.

First we will look into the derivatives in first hidden layer instead of output layer. It is better because it generalises for layer that has multiple neurons in both prev and current layer. In our case, we have only one neuron in the output layer. So, we go from left to right.

While coding it out, there is dependency in the gradient computation. so it has to be backwards. Here, just for the sake of explaining things, we go from left to right. Hope this is clear enough.

Layer - 1(First Hidden Layer)

In the above diagram, it is important to mention few things.

$dZ_1$ is what we have to find.
$dA1$ assumed to be computed previously while we compute the gradients in second hidden layer.
$* \rightarrow$ element-wise multiplication.

We still need to find $\Large {\frac{\partial Z_1}{\partial W_1}}$ and $\Large {\frac{\partial Z_1}{\partial B_1}}$ in order to find $dW_1$ and $dB_1$ as $\boxed {dW_1 = \frac{\partial Z_1}{\partial W_1}dZ_1}$ and $\boxed {dB_1 = \frac{\partial Z_1}{\partial B_1}dZ_1}$ .

$\cdots$ in the matrices represents matrix broadcasting. It happens within numpy or tensorflow to do the matrix multiplication.
The shapes that are provided considering only one input example. If you pass more than one single point, the shapes will change.
Even for a batch of input, the final vectorization remains same. except that you need to divide the matrix multiplication by the size of the input. In order to understand that, try to work out the forward propagation with batch of inputs and backward propagation with the same.

Layer - 2 (Second Hidden Layer)

We need to find $\Large {\frac{\partial Z_2}{\partial W_2}}$ and $\Large {\frac{\partial Z_2}{\partial B_2}}$ in order to find $dW_2$ and $dB_2$ as $\boxed {dW_2 = \frac{\partial Z_2}{\partial W_2}dZ_2}$ and $\boxed {dB_2 = \frac{\partial Z_2}{\partial B_2}dZ_2}$ . As mentioned previously, $dA_2$ is assumed to be pre-computed. ie., it would've been computed at the Output layer (next layer in the sequence) itself.

$dW_2\ and\ dB_2$ will be calculated exactly as above. But, I would strongly recommend to do it again by yourself to verify the same.

\large \boxed {dW_2 = dZ_{2_{\ (3 \times 1)}}\ {(A_{1 _{\ (3 \times 1)}})^T} } \hspace{2em} \large \boxed {dB_2 = dZ_{2_{\ (3 \times 1)}}} \hspace{2em} \large \boxed {dA_1=\ ?}

Lastly, we need to calculate a different kind of gradient at the final layer, ie., $dZ3$ .

Output Layer

Unlike hidden layers, the derivative for ${dZ_3}$ is little different. Because $A_3$ is connected to loss function directly. So the derivative that we found previously doesn't hold.

\gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\greenhash{#417505} \gdef\green{\color{\greenhash}} \gdef\d#1#2{\frac{\partial#1} {\partial#2}} \gdef\layer{1} \gdef\prevlayer{0} \gdef\black{\textcolor{black}} \begin{aligned} \red{dZ_3} &= \red {\d{A_3}{Z_3}}{\d{L}{A_3}} \\\\ &= {\red \d{A_3}{Z_3} \frac {\partial}{\partial A_3}} \begin{Bmatrix} -log({\green{A_3}}) \hspace{2.5em} \text{ if, } {Y = 1} \\ -log({\green{1-A_3}}) \hspace{1em} \text{ if, } {Y = 0} \end{Bmatrix} \\\\ &= {\red \d{A_3}{Z_3} } \begin{Bmatrix} -{\Large\green{\frac{1}{A_3}}} \hspace{2.5em} \text{ if, } {Y = 1} \\ {\Large\green{\frac{1}{1-A_3}}} \hspace{1.3em} \text{ if, } {Y = 0} \end{Bmatrix} \\\\ &= {\green {A_3(1-A_3)} } \begin{Bmatrix} -{\Large\green{\frac{1}{A_3}}} \hspace{2.5em} \text{ if, } {Y = 1} \\ {\Large\green{\frac{1}{1-A_3}}} \hspace{1.3em} \text{ if, } {Y = 0} \end{Bmatrix} \\\\ &= \begin{Bmatrix} {\green -(1-A_3)} \hspace{2em} \text{if, } Y=1\\ {\green A_3} \hspace{5.3em} \text{if, } Y=0 \end{Bmatrix} \\\\ &= \begin{Bmatrix} {\green A_3-1} \hspace{2em} \text{if, } Y=1\\ {\green A_3} \hspace{3.8em} \text{if, } Y=0 \end{Bmatrix} \\\\ &= \green A_3-Y \end{aligned}

The gradients are in the red and inputs are in green. We are trying to compute $\red {dZ_3}$ ie., Derivative w.r.t Loss, we find the intermediate gradients and multiply them ( chain rule of differentiation). The remaining derivatives for $W_3,A_2, \text{and}\ B_3$ are same for this layer too and are mentioned below.

\large \boxed {dW_3 = dZ_{3_{\ (1 \times 1)}}\ {(A_{2 _{\ (3 \times 1)}})^T} } \hspace{0.5em} \large \boxed {dB_3 = dZ_{3_{\ (1 \times 1)}}} \hspace{0.5em} \large \boxed { dA_2=( {W_{3 _{\ (1 \times 3)}} )^T} dZ_{3_{\ (1 \times 1)}} } \hspace{0.5em}

Pseudo code

class DNN():
   ...   
   ...
   ...   
   ...
   
def forward(self, X):
   """
   X : input
   returns out(last layer), cache
   """
   
def compute_gradients(self, out, cache):
   """
   out : output of last layer(output layer) ie., A_3
   returns: dictionary that has all the gradients for all
   the variables
   """
   grads = {}
   
   # output layer
   cur_layer = 3
   grads[f'Z{cur_layer}'] = ...  # grads['Z3']
   # compute grads for 'W3', 'B3' and 'A2' as well
   
   # gradients for the other layers
   # loop from last hidden layer to first hidden layer
   # compute and store the gradients in the `grads` dictionary.
   # ex: For second hidden layer, we need
   #     1.`A2` and `dA2`(computed in prev loop) --> to calculate `dZ2`,
   #     2.`dZ2` and `A1` --> to calculate `dW2` 
   #     3.`dZ2` --> to calculate `dB2`
   #     4.`W2` and `Z2` --> To calculate `dA1`(useful in next loop)
   # so, during forward propagation, we need A2(output), A1(input),
   # W2 and Z2
   
   # for the first hidden layer, computing `A0`(input) is not required.
   return grads
   
   ...   
   ...
   ...   
   ...

For batch input, the gradients will change as follows

$dB_3$ has multiple columns instead of single column. Each column represents gradient for single training example. If the input size is $m$ , there would be $m$ columns. In order to get the cumulative gradient, we need to sum it up and then divide by $m$ to get the average gradient.
$dW_3$ has cumulative gradients in each entry while calculating. The shape doesn't change. So, we just need to divide in order to get the average gradient.

You will understand this if you work out some example with more than one input. In this case the input size is $2 \times m$ , where $m$ is the number of input examples.

You need to write the program that works for batch input rather than single input. Even though I explained about the changes in the formula, It's better to work it out with an example before implementing it.

Task

Take a batch of input of shape (2 x m), where m is the number of input samples and work out the backward and forward propagation and change the gradient formulas wherever necessary.
Observe all the gradient formulas, and identify the required matrices at each layer to compute gradients in that layer and take a note of them.
Modify the forward propagation to store all the intermediate matrices which you have observed in above step and store them. (You can use dictionaries to store the variables with keys variable+layer_numer Ex: Use W1 to store the weight matrix at first hidden layer. Use A2as the key to store the outputs from second hidden layer etc.,.)
Write a function gradients that will calculate gradients for all those weight and bias matrices at all the layers. The cache that you have stored during the forward propagation will come in handy here.

PreviousForward propagation NextUpdating Parameters

Last updated 5 years ago

Was this helpful?