Back propagation

Prerequisites

Just to make it clear, I need to make sure that you really understood why we are finding derivatives and how SGD is used to update the parameters. These two videos would help.

Gradient Descent Intuition

And after this, you need to understand how we find the derivatives in our complex function. I suggest you to go through any of these two resources ( video OR notes from cs231n ), preferably both.

Do not move ahead until you really understood backpropagation and gradient descent. Otherwise you can't really focus on the vectorization part of this.

Back propagation

The notes will explain back propagation using simple computation graphs.

Representing forward prop as a simple computational graph

Let's look at the figure that explains whole process once again. It somewhat looks like a simple computational graph with few multiplications, additions, sigmoid functions at each layer.

As mentioned in the above cs231n article, sigmoid can further divide into operations like, exponential, fractions, etc., So I want you to convince that the whole forward propagation is a big complex computational graph.

Here, instead of showing all those internal operations, we will abstract it with only the high level operations as we know like sigmoid.

Notation

At this point, you must have understood that we need to find the derivative of variables like W3, B3, A2, ... with respect to Loss function(L) so that we can perform the gradient updation step.

If we are finding derivative loss (L)\mathbf {(L)}of any intermediate variable like Z3\small \mathbf {Z_3} , ie., LZ3 as dZ3\large \bold {\frac {\partial \mathbf L} {\partial Z_3}} \small \text { as } \small \mathbf {dZ_3}, derivative of loss (L)\mathbf {(L)} with respect to Z3\small \mathbf {Z_3}ie., LA3 as dA3\large \bold {\frac {\partial \mathbf L} {\partial A_3}} \small \text { as } \small \mathbf {dA_3}.

On the other hand, the derivative of any intermediate variable with respect to any other intermediate variable would be represented normally, Ex. Derivative of A3\small \mathbf {A_3}with respect to Z3\small \mathbf {Z_3}as A3Z3\large \bold {\frac {\partial \mathbf {A_3}} {\partial Z_3}} .

Back propagation

First, we will build an intuition on how we can calculate the gradients, for all the weight and bias matrices starting from final layer to the first layer without actually computing the derivatives. At the end, you will be able to write a general representation to calculate the gradients at every layer so that we can write that in a simple for loop when you implement it. Just to get familiarize with the process, you can watch this video.

Math behind back propagation

And then we will deep dive into calculating gradients. Here we will explain how calculating gradients for a single weight attribute and vectorize it for the entire weight matrix at that layer.

Output Layer

From this diagram, we can clearly see that Z3Z_3 is a function of W3,A2 and B3W_3, A_2 \ and\ B_3, and A3A_3is a function of Z3Z_3. It is worth mentioning about why sigmoid as an activation function. As this is the last layer, the outputs should be probabilities for each class. Here, we are dealing with a simple binary classification, and thus sigmoid is self sufficient.

If we have more than two classes, we should use softmax as an activation function. At the end, you get to implement that. Whatever the activation function is, it is enough to say that A3A_3 as a function of Z3Z_3. As you already know, we can find the gradient of LLw.r.t W3,A2 and B3W_3, A_2 \ and\ B_3can be written using chain rule as follows.

LW3=Z3W3A3Z3LA3chain rule=Z3W3LZ3=Z3W3dZ3    dW3=Z3W3dZ3LA2=Z3A2A3Z3LA3chain rule=Z3A2LZ3=Z3A2dZ3    dA2=Z3A2dZ3LB3=Z3B3A3Z3LA3chain rule=Z3B3LZ3=Z3B3dZ3    dB3=Z3B3dZ3\gdef\deriv#1#2{\frac {\partial #1} {\partial #2}} \gdef\layer{3} \gdef\prevlayer{2} \begin{aligned} \deriv{L}{W_\layer} &= \deriv{Z_\layer}{W_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{W_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{W_\layer} dZ_\layer &\implies \boxed {dW_\layer = \deriv{Z_\layer}{W_\layer} dZ_\layer} \\\\ \deriv{L}{A_\prevlayer} &= \deriv{Z_\layer}{A_\prevlayer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{A_\prevlayer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{A_\prevlayer} dZ_\layer &\implies \boxed {dA_\prevlayer = \deriv{Z_\layer}{A_\prevlayer} dZ_\layer} \\\\ \deriv{L}{B_\layer} &= {\deriv{Z_\layer}{B_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}}} = \deriv{Z_\layer}{B_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{B_\layer} dZ_\layer &\implies \boxed {dB_\layer= \deriv{Z_\layer}{B_\layer} dZ_\layer} \\ \end{aligned}

Take your time and understand the above derivatives as this is the key to make it more generic to all the layers.

Layer - 2 (Second Hidden Layer)

And the gradients for the intermediate variables at this layer would look like as follows. And a little sneak peek, we can easily find the derivative term dZ2dZ_2as we already found dA2dA_2previously.

LW2=Z2W2A2Z2LA2chain rule=Z2W2LZ2=Z2W2dZ2    dW2=Z2W2dZ2LA1=Z2A1A2Z2LA2chain rule=Z2A1LZ2=Z2A1dZ2    dA1=Z2A1dZ2LB2=Z2B2A2Z2LA2chain rule=Z2B2LZ2=Z2B2dZ2    dB2=Z2B2dZ2\gdef\deriv#1#2{\frac {\partial #1} {\partial #2}} \gdef\layer{2} \gdef\prevlayer{1} \begin{aligned} \deriv{L}{W_\layer} &= \deriv{Z_\layer}{W_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{W_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{W_\layer} dZ_\layer &\implies \boxed {dW_\layer = \deriv{Z_\layer}{W_\layer} dZ_\layer} \\\\ \deriv{L}{A_\prevlayer} &= \deriv{Z_\layer}{A_\prevlayer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{A_\prevlayer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{A_\prevlayer} dZ_\layer &\implies \boxed {dA_\prevlayer = \deriv{Z_\layer}{A_\prevlayer} dZ_\layer} \\\\ \deriv{L}{B_\layer} &= {\deriv{Z_\layer}{B_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}}} = \deriv{Z_\layer}{B_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{B_\layer} dZ_\layer &\implies \boxed {dB_\layer= \deriv{Z_\layer}{B_\layer} dZ_\layer} \\ \end{aligned}

Layer - 1(First Hidden Layer)

This is the first hidden layer graph. This is the end for propagating backward to compute the gradients. As you have already done this step for previous layers, It it easy to write the equations to compute the gradients.

LW1=Z1W1A1Z1LA1chain rule=Z1W1LZ1=Z1W1dZ1    dW1=Z1W1dZ1LB1=Z1B1A1Z1LA1chain rule=Z1B1LZ1=Z1B1dZ1    dB1=Z1B1dZ1\gdef\deriv#1#2{\frac {\partial #1} {\partial #2}} \gdef\layer{1} \gdef\prevlayer{0} \begin{aligned} \deriv{L}{W_\layer} &= \deriv{Z_\layer}{W_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}} = \deriv{Z_\layer}{W_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{W_\layer} dZ_\layer &\implies \boxed {dW_\layer = \deriv{Z_\layer}{W_\layer} dZ_\layer} \\\\ \deriv{L}{B_\layer} &= {\deriv{Z_\layer}{B_\layer} \overbrace{\deriv{A_\layer}{Z_\layer} \deriv{L}{A_\layer}}^{\text {chain rule}}} = \deriv{Z_\layer}{B_\layer} \deriv{L}{Z_\layer} = \deriv{Z_\layer}{B_\layer} dZ_\layer &\implies \boxed {dB_\layer= \deriv{Z_\layer}{B_\layer} dZ_\layer} \\ \end{aligned}

Weird!, why didn't we calculate the gradients for A0A_0? It is the input and it is a constant and thus need not to learn while training.

There are times where we might want to learn the input as well. For an example, during word embedding, we input random vector for each word via and we will learn those weights OR vectors for each word through back propagation. Just wanted you to know.

By now, you should get a general Idea on how to generalise the gradient computation at each layer and you should be able to code that out.

Computing Gradients

From above equations, it is clear that we need to find all the dZl s,where l=layerdZ_l\ 's, where\ l=layer. And it is quite different for the output layer and the intermediate layers. Because, at the final layer, the ZZconnects to loss function and at the intermediate layers, ZZis connected to the activation at that layer. So the generalisation for the gradients of ZZis same for all the layers except for the Output layer. You might be confused by reading this. But it will be clear as we go. I want to mention this beforehand just to give you a heads up. Don't worry if you couldn't understand it, it will be clear as we go.

First we will look into the derivatives in first hidden layer instead of output layer. It is better because it generalises for layer that has multiple neurons in both prev and current layer. In our case, we have only one neuron in the output layer. So, we go from left to right.

While coding it out, there is dependency in the gradient computation. so it has to be backwards. Here, just for the sake of explaining things, we go from left to right. Hope this is clear enough.

Layer - 1(First Hidden Layer)

In the above diagram, it is important to mention few things.

  • dZ1dZ_1is what we have to find.

  • dA1dA1assumed to be computed previously while we compute the gradients in second hidden layer.

  • * \rightarrowelement-wise multiplication.

We still need to find Z1W1\Large {\frac{\partial Z_1}{\partial W_1}}and Z1B1\Large {\frac{\partial Z_1}{\partial B_1}} in order to find dW1dW_1and dB1dB_1 as dW1=Z1W1dZ1\boxed {dW_1 = \frac{\partial Z_1}{\partial W_1}dZ_1}and dB1=Z1B1dZ1\boxed {dB_1 = \frac{\partial Z_1}{\partial B_1}dZ_1}.

\gdef\layer{1} \gdef\layersize{3} \gdef\prevlayer{0} \gdef\prevlayersize{2} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} % - derivative \gdef\d#1#2 {\frac {\partial#1} {\partial#2} } % - Matrix \gdef\bmat#1{\begin{bmatrix}#1\end{bmatrix}} \gdef\mat#1{\begin{matrix}#1\end{matrix}} % - colors \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} % - size \large % ------------------- \begin{aligned} Z_\layer &= \bmat{{\red \z11}\\\\{\green \z12}\\\\{\blue \z13}} = \bmat{{\red \w\layer11}{\brown x_1} + {\red \w\layer21}{\brown x_2} + {\red \b\layer1}\\\\ {\green\w\layer12}{\brown x_1} + {\green \w\layer22}{\brown x_2} + {\green \b\layer2}\\\\ {\blue \w\layer13}{\brown x_1} + {\blue \w\layer23}{\brown x_2} + {\blue \b\layer3}} \text{, When finding the derivatives of }\\\\ &\text{ weight components, try to focus on all the elements that it is }\\ &\text{dependent on, and then sum them up. here it is just one component }\\ &\text{of Z vector. But when you deal with batch input, you might want } \\ &\text{to remember this.} \\\\ dW_\layer &= \bmat{\red d\w\layer11 \hspace{1em} d\w\layer21 \\\\ \green d\w\layer12 \hspace{1em} d\w\layer22 \\\\ \blue d\w\layer13 \hspace{1em} d\w\layer23 } = \bmat{\red {\Large \d{\z\layer1}{\w\layer11}}d\z\layer1 \hspace{1em} {\Large \d{\z\layer1}{\w\layer21}} d\z\layer1 \\\\ \green {\Large \d{\z\layer2}{\w\layer12}}d\z\layer2 \hspace{1em} {\Large \d{\z\layer2}{\w\layer22}} d\z\layer2 \\\\ \blue {\Large \d{\z\layer3}{\w\layer13}}d\z\layer3 \hspace{1em} {\Large \d{\z\layer3}{\w\layer23}} d\z\layer3 } = \bmat{\brown x_1 \red d\z\layer1 \hspace{1em} \brown x_2 \red d\z\layer1 \\\\ \brown x_1 \green d\z\layer2 \hspace{1em} \brown x_2 \green d\z\layer2 \\\\ \brown x_1 \blue d\z\layer3 \hspace{1em} \brown x_2 \blue d\z\layer3 } \\\\ % ------------------- &=\bmat{\red d\z\layer1 \brown x_1 \hspace{1em} \red d\z\layer1 \brown x_2 \\\\ \green d\z\layer2 \brown x_1 \hspace{1em} \green d\z\layer2 \brown x_2 \\\\ \blue d\z\layer3 \brown x_1 \hspace{1em} \blue d\z\layer3 \brown x_2 } = \bmat{\red d\z\layer1 \hspace{1em} \cdots \hspace{1em} \\\\ \green d\z\layer2 \hspace{1em} \cdots \hspace{1em} \\\\ \blue d\z\layer3 \hspace{1em} \cdots \hspace{1em} } \bmat{\brown x_1 \hspace{1em} x_2 \\\\ \vdots \hspace{1em} \vdots \\\\ \vdots \hspace{1em} \vdots } = \bmat{\red d\z\layer1 \\\\ \green d\z\layer2 \\\\ \blue d\z\layer3}_{\layersize \times 1} \bmat{\brown x_1 \hspace{1em} x_2}_{1\times\prevlayersize} \\\\ &= {dZ_\layer}_{_{\ (\layersize \times 1)}} {(A_{\prevlayer _{\ (\prevlayersize \times 1)}})^T} % ------------------- % dB_1 derivative % ------------------- \\\\ dB_\layer &= \bmat{\red d\b\layer1 \\\\ \green d\b\layer2 \\\\ \blue d\b\layer3 } = \bmat{\red {\Large \d{\z\layer1}{\b\layer1}}d\z\layer1 \\\\ \green {\Large \d{\z\layer2}{\b\layer2}}d\z\layer2\\\\ \blue {\Large \d{\z\layer3}{\b\layer3}}d\z\layer3 } = \bmat{\red d\z\layer1 \\\\ \green d\z\layer2\\\\ \blue d\z\layer3 } = {dZ_\layer}_{_{\ (\layersize \times 1)}} \end{aligned} % -------------------
  • \cdotsin the matrices represents matrix broadcasting. It happens within numpy or tensorflow to do the matrix multiplication.

  • The shapes that are provided considering only one input example. If you pass more than one single point, the shapes will change.

  • Even for a batch of input, the final vectorization remains same. except that you need to divide the matrix multiplication by the size of the input. In order to understand that, try to work out the forward propagation with batch of inputs and backward propagation with the same.

Layer - 2 (Second Hidden Layer)

We need to find Z2W2\Large {\frac{\partial Z_2}{\partial W_2}}and Z2B2\Large {\frac{\partial Z_2}{\partial B_2}} in order to find dW2dW_2and dB2dB_2 as dW2=Z2W2dZ2\boxed {dW_2 = \frac{\partial Z_2}{\partial W_2}dZ_2}and dB2=Z2B2dZ2\boxed {dB_2 = \frac{\partial Z_2}{\partial B_2}dZ_2}. As mentioned previously, dA2dA_2is assumed to be pre-computed. ie., it would've been computed at the Output layer (next layer in the sequence) itself.

dW2 and dB2dW_2\ and\ dB_2will be calculated exactly as above. But, I would strongly recommend to do it again by yourself to verify the same.

dW2=dZ2 (3×1) (A1 (3×1))TdB2=dZ2 (3×1)dA1= ?\large \boxed {dW_2 = dZ_{2_{\ (3 \times 1)}}\ {(A_{1 _{\ (3 \times 1)}})^T} } \hspace{2em} \large \boxed {dB_2 = dZ_{2_{\ (3 \times 1)}}} \hspace{2em} \large \boxed {dA_1=\ ?}
\gdef\layer{2} \gdef\layersize{3} \gdef\prevlayersize{3} \gdef\prevlayer{1} \gdef\w#1#2{w^{\layer}_{#1#2}} \gdef\a#1{a^{\prevlayer}_#1} \gdef\b#1{b^{\layer}_#1} \gdef\z#1{z^{\layer}_#1} % - derivative \gdef\d#1#2 {\frac {\partial#1} {\partial#2} } % - Matrix \gdef\bmat#1{\begin{bmatrix}#1\end{bmatrix}} \gdef\mat#1{\begin{matrix}#1\end{matrix}} % - colors \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} % - size \large % ------------------- \begin{aligned} % ------------------- Z_\layer &= \bmat{{\red \z1}\\\\{\green \z2}\\\\{\blue \z3}} = \bmat{{\red \w11}{\brown \a1} + {\red \w21}{\brown \a2} + {\red \w31}{\brown \a3} + {\red \b11}\\\\ {\green\w12}{\brown \a1} + {\green \w22}{\brown \a2} + {\green\w32}{\brown \a3} + {\green \b2}\\\\ {\blue \w13}{\brown \a1} + {\blue \w23}{\brown \a2} + {\blue \w33}{\brown \a3} + {\blue \b3}} \\\\ &\text{We will find the derivatives for individual components of }A_1. \\ &\text{As there are multiple variables depends on a single component, }\\ &\text{we will just sum them up. }Ex: a^\prevlayer_1 \text { depends on }\w11, \w\layer12 \ and\ \w\layer13. \\ &\text{So, while computing gradiets, we will just sum them up.} % ------------------- \\\\ % ------------------- dA_\prevlayer &= {\brown \bmat{d\a1 \\\\ d\a2 \\\\ d\a3}} = \bmat{ \brown {\Large \d{\z1}{\a1}} \red d\z1 + \brown{\Large \d{\z2}{\a1}} \green d\z2 + \brown{\Large \d{\z3}{\a1}} \blue d\z3 \\\\ \brown {\Large \d{\z1}{\a2}} \red d\z1 + \brown{\Large \d{\z2}{\a2}} \green d\z2 + \brown{\Large \d{\z3}{\a2}} \blue d\z3 \\\\ \brown {\Large \d{\z1}{\a3}} \red d\z1 + \brown{\Large \d{\z2}{\a3}} \green d\z2 + \brown{\Large \d{\z3}{\a3}} \blue d\z3 \\\\ } = \bmat{ {\red \w11 d\z1} + {\green \w12 d\z2} + {\blue \w13 d\z3}\\\\ {\red \w21 d\z1} + {\green \w22 d\z2} + {\blue \w23 d\z3}\\\\ {\red \w31 d\z1} + {\green \w32 d\z2} + {\blue \w33 d\z3}\\\\ } \\\\ &=\bmat{\red \w11 & \green \w12 & \blue \w13 \\\\ \red \w21 & \green \w22 & \blue \w23 \\\\ \red \w31 & \green \w32 & \blue \w33 } \bmat{\red \z1 \\\\ \green \z2 \\\\ \blue \z3} = ({W_\layer}_{_{\ (\layersize \times \prevlayersize)}})^T {(Z_\layer)_{_{(\layersize \times 1)}}} \end{aligned} % -------------------

Lastly, we need to calculate a different kind of gradient at the final layer, ie.,dZ3dZ3.

Output Layer

Unlike hidden layers, the derivative for dZ3{dZ_3}is little different. Because A3A_3is connected to loss function directly. So the derivative that we found previously doesn't hold.

dZ3=A3Z3LA3=A3Z3A3{log(A3) if, Y=1log(1A3) if, Y=0}=A3Z3{1A3 if, Y=111A3 if, Y=0}=A3(1A3){1A3 if, Y=111A3 if, Y=0}={(1A3)if, Y=1A3if, Y=0}={A31if, Y=1A3if, Y=0}=A3Y\gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\greenhash{#417505} \gdef\green{\color{\greenhash}} \gdef\d#1#2{\frac{\partial#1} {\partial#2}} \gdef\layer{1} \gdef\prevlayer{0} \gdef\black{\textcolor{black}} \begin{aligned} \red{dZ_3} &= \red {\d{A_3}{Z_3}}{\d{L}{A_3}} \\\\ &= {\red \d{A_3}{Z_3} \frac {\partial}{\partial A_3}} \begin{Bmatrix} -log({\green{A_3}}) \hspace{2.5em} \text{ if, } {Y = 1} \\ -log({\green{1-A_3}}) \hspace{1em} \text{ if, } {Y = 0} \end{Bmatrix} \\\\ &= {\red \d{A_3}{Z_3} } \begin{Bmatrix} -{\Large\green{\frac{1}{A_3}}} \hspace{2.5em} \text{ if, } {Y = 1} \\ {\Large\green{\frac{1}{1-A_3}}} \hspace{1.3em} \text{ if, } {Y = 0} \end{Bmatrix} \\\\ &= {\green {A_3(1-A_3)} } \begin{Bmatrix} -{\Large\green{\frac{1}{A_3}}} \hspace{2.5em} \text{ if, } {Y = 1} \\ {\Large\green{\frac{1}{1-A_3}}} \hspace{1.3em} \text{ if, } {Y = 0} \end{Bmatrix} \\\\ &= \begin{Bmatrix} {\green -(1-A_3)} \hspace{2em} \text{if, } Y=1\\ {\green A_3} \hspace{5.3em} \text{if, } Y=0 \end{Bmatrix} \\\\ &= \begin{Bmatrix} {\green A_3-1} \hspace{2em} \text{if, } Y=1\\ {\green A_3} \hspace{3.8em} \text{if, } Y=0 \end{Bmatrix} \\\\ &= \green A_3-Y \end{aligned}

The gradients are in the red and inputs are in green. We are trying to compute dZ3\red {dZ_3}ie., Derivative w.r.t Loss, we find the intermediate gradients and multiply them ( chain rule of differentiation). The remaining derivatives for W3,A2,and B3W_3,A_2, \text{and}\ B_3are same for this layer too and are mentioned below.

dW3=dZ3 (1×1) (A2 (3×1))TdB3=dZ3 (1×1)dA2=(W3 (1×3))TdZ3 (1×1)\large \boxed {dW_3 = dZ_{3_{\ (1 \times 1)}}\ {(A_{2 _{\ (3 \times 1)}})^T} } \hspace{0.5em} \large \boxed {dB_3 = dZ_{3_{\ (1 \times 1)}}} \hspace{0.5em} \large \boxed { dA_2=( {W_{3 _{\ (1 \times 3)}} )^T} dZ_{3_{\ (1 \times 1)}} } \hspace{0.5em}

Pseudo code

class DNN():
   ...   
   ...
   ...   
   ...
   
def forward(self, X):
   """
   X : input
   returns out(last layer), cache
   """
   
def compute_gradients(self, out, cache):
   """
   out : output of last layer(output layer) ie., A_3
   returns: dictionary that has all the gradients for all
   the variables
   """
   grads = {}
   
   # output layer
   cur_layer = 3
   grads[f'Z{cur_layer}'] = ...  # grads['Z3']
   # compute grads for 'W3', 'B3' and 'A2' as well
   
   # gradients for the other layers
   # loop from last hidden layer to first hidden layer
   # compute and store the gradients in the `grads` dictionary.
   # ex: For second hidden layer, we need
   #     1.`A2` and `dA2`(computed in prev loop) --> to calculate `dZ2`,
   #     2.`dZ2` and `A1` --> to calculate `dW2` 
   #     3.`dZ2` --> to calculate `dB2`
   #     4.`W2` and `Z2` --> To calculate `dA1`(useful in next loop)
   # so, during forward propagation, we need A2(output), A1(input),
   # W2 and Z2
   
   # for the first hidden layer, computing `A0`(input) is not required.
   return grads
   
   ...   
   ...
   ...   
   ...

For batch input, the gradients will change as follows

  • dB3dB_3has multiple columns instead of single column. Each column represents gradient for single training example. If the input size is mm, there would be mmcolumns. In order to get the cumulative gradient, we need to sum it up and then divide by mmto get the average gradient.

  • dW3dW_3has cumulative gradients in each entry while calculating. The shape doesn't change. So, we just need to divide in order to get the average gradient.

You will understand this if you work out some example with more than one input. In this case the input size is 2×m2 \times m, where mmis the number of input examples.

You need to write the program that works for batch input rather than single input. Even though I explained about the changes in the formula, It's better to work it out with an example before implementing it.

Task

  1. Take a batch of input of shape (2 x m), where m is the number of input samples and work out the backward and forward propagation and change the gradient formulas wherever necessary.

  2. Observe all the gradient formulas, and identify the required matrices at each layer to compute gradients in that layer and take a note of them.

  3. Modify the forward propagation to store all the intermediate matrices which you have observed in above step and store them. (You can use dictionaries to store the variables with keys variable+layer_numer Ex: Use W1 to store the weight matrix at first hidden layer. Use A2as the key to store the outputs from second hidden layer etc.,.)

  4. Write a function gradients that will calculate gradients for all those weight and bias matrices at all the layers. The cache that you have stored during the forward propagation will come in handy here.

Last updated