Forward propagation

Before jumping into the implementation, I would suggest you to watch this video, just to refresh your memory on how Neural Network works. This is part of a playlist. And I will refer all those videos wherever necessary. For now, you can watch this video and move on to the article.

Assuming that you understood the video, let's take a closer look at the network once again with annotated parameters for the input layer, first hidden layer and the weights that connecting them.

Understanding Notations

We will consider $\boldsymbol H_1$ as our current layer and when we talk w.r.t $\boldsymbol H_1$ , then $\boldsymbol I$ would become the previous layer.

In this notation, superscript always denotes the layer number and the subscript always related to the node in a particular layer.

From the above diagram,

$w_{11}^{1} \ \rightarrow$ weight of an edge that connects the $1^{st}$ neuron in the previous layer $(l-1)$ to the $1^{st}$ neuron in the present layer $(1)$ . $w_{21}^{1} \ \rightarrow$ weight of an edge that connects the $2^{nd}$ neuron in the previous layer $(l-1)$ to the $1^{st}$ neuron in the present layer $(1)$ . $w_{12}^{1} \rightarrow$ weight of an edge that connects $1^{st}$ neuron from the previous layer $(l-1)$ to the $2^{nd}$ neuron in the present layer $(1)$ . $w_{22}^{1} \rightarrow$ weight of an edge that connects $2^{nd}$ neuron from the previous layer $(l-1)$ to the $2^{nd}$ neuron in the current layer $(1)$ . ... .. . $z^1_1, a^1_1 \rightarrow$ pre-activation and post-activation of $1^{st}$ layer's $1^{st}$ neuron $z^1_2, a^1_2 \rightarrow$ pre-activation and post-activation of $1^{st}$ layer's $2^{nd}$ neuron $z^1_3, a^1_3 \rightarrow$ pre-activation and post-activation of $1^{st}$ layer's $3^{rd}$ neuron

In general, $\boldsymbol {w_{i,j}^{l}} \rightarrow \text {weight between } \boldsymbol {(l-1)^{th}} \text { layer's } \boldsymbol{(i)^{th}} \text { node } \text {and } \boldsymbol {(l)^{th}} \text { layer's } \boldsymbol {(j)^{th}} \text {node.}$ $\boldsymbol {b^l_i} \rightarrow \text {bias for } \boldsymbol {(l)^{th}} \text { layer's } \boldsymbol {(i)^{th}} \text {node. }$ $\boldsymbol {z^l_i} \rightarrow \text {pre-activation of } \boldsymbol {(l)^{th}} \text { layer's } \boldsymbol {(i)^{th}} \text {node. }$

Layer - 1 ( First hidden layer )

Non-vectorized $\longrightarrow$ Vectorized :

The pre-activations for the first layer can be computed as follows (assuming all the

\gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \large \begin{aligned} %================================ &{\red \w111}{\brown x_1} + {\red \w121}{\brown x_2} + {\red \b11} = {\red \z11}\\\\ % ================================ &{\green\w112}{\brown x_1} + {\green \w122}{\brown x_2} + {\green \b12} = {\green \z12}\\\\ %================================ &{\blue \w113}{\brown x_1} + {\blue \w123}{\brown x_2} + {\blue \b13} = {\blue \z13} %================================ \end{aligned} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \longrightarrow \large %============ \begin{pmatrix} \red \w111 \enspace \w121 \\\\ \green \w112 \enspace \w122 \\\\ \blue \w113 \enspace \w123 \end{pmatrix} %============ \begin{pmatrix} \brown x_1 \\\\ \brown x_2 \end{pmatrix} %============ + %============ \begin{pmatrix} \red \b11\\\\ \green \b12\\\\ \blue \b13 \end{pmatrix} %============ = %============ \begin{pmatrix} \red \z11 \\\\ \green \z12\\\\ \blue \z13 \end{pmatrix} %============

We can replace the $z's$ vector with $\boldsymbol {Z_1}$ (1 in the denominator indicates the layer number ie., 1 in this case), weight matrix with $\boldsymbol {W_1}$ , input with $\boldsymbol {A_0}$ and finally bias vector for this layer as $\boldsymbol {W_1}$ . The denominator indicates the layer number ie., $1$ in this case

why $\boldsymbol A_0$ instead of $\boldsymbol X$ for the input?

During forward propagation in further layers, the input would be the the output from previous layers which would be denoted as $\boldsymbol A_{(l-1)}$ , where $(l-1)$ is the previous layer index. ie., $\boldsymbol A_1$ is the input for second hidden layer, and for 3rd hidden layer, the input is $\boldsymbol A_2$ and so on.

Just to make the notation consistent across the network, we consider $\boldsymbol A_0$ instead of $\boldsymbol X$

Finally, we can replace the above equations with the below vector notation.

\large \begin{aligned} &W_{1_{\enspace(3,2)}} &A_0{_{\enspace(2,1)}} &+ B_{1_{\enspace(3,1)}} &= Z_{1_{\enspace(3,1)}}\\ &[{W_1}&{A_0}]_{_{\enspace(3,1)}} &+ B_{1_{\enspace(3,1)}} &= Z_{1_{\enspace(3,1)}} \end{aligned}

where,

\gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \large W_1 = \begin{pmatrix} \red \w111 \enspace \w121 \\\\ \green \w112 \enspace \w122 \\\\ \blue \w113 \enspace \w123 \end{pmatrix}, A_0 = \begin{pmatrix} \brown x_1 \\\\ \brown x_2 \end{pmatrix} , B_1 = \begin{pmatrix} \red \b11\\\\ \green \b12\\\\ \blue \b13 \end{pmatrix}, Z_1 = \begin{pmatrix} \red \z11 \\\\ \green \z12\\\\ \blue \z13 \end{pmatrix}

Now, post activation $\large \boldsymbol {A_1}$ is just a function of $\large \boldsymbol {Z_1}$ , which can be of any non-linear function such as $\text {Sigmoid}$ , $\text {Tanh}$ , $\text {ReLU}$ , $\text {Leaky ReLU}$ . In this case, I would be sticking with $\text {Sigmoid}$ .

\large \begin{aligned} A_1 &= Sigmoid(Z_1)\\ &= \sigma(Z_1)\\ &=\frac{1 }{1+e^{- Z_1}} \end{aligned}

Layer 2 ( Second Hidden layer )

We will look at the second layer, But as we have already discussed on how to vectorize the operations, we will mention the matrices directly.

Non-vectorized $\longrightarrow$ Vectorized:

\gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\layer{2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \large {\begin{aligned} %================================ &{\red \w211}{\brown \a11} + {\red \w221}{\brown \a12} + {\red \w231}{\brown \a13} + {\red \b21} = {\red \z21}\\\\ &{\green \w212}{\brown \a11} + {\green \w222}{\brown \a12} + {\green \w232}{\brown \a13} + {\green \b22} = {\green \z22} \\\\ &{\blue \w213}{\brown \a11} + {\blue \w223}{\brown \a12} + {\blue \w233}{\brown \a13} + {\green \b23} = {\green \z23} %================================ \end{aligned}} \longrightarrow {%============ \begin{pmatrix} \red \w\layer11 \enspace \w\layer21 \enspace \w\layer31 \\\\ \green \w\layer12 \enspace \w\layer22 \enspace \w\layer32 \\\\ \blue \w\layer13 \enspace \w\layer23 \enspace \w\layer33 \end{pmatrix} %============ \begin{pmatrix} \brown \a11 \\\\ \brown \a12 \\\\ \brown \a13 \end{pmatrix} %============ + %============ \begin{pmatrix} \red \b\layer1\\\\ \green \b\layer2\\\\ \blue \b\layer3 \end{pmatrix} %============ = %============ \begin{pmatrix} \red \z21 \\\\ \green \z22\\\\ \blue \z23 \end{pmatrix}} %============

For this layer, the inputs are, $\large \boldsymbol {A_1}$ (output from the previous layer ), the weight matrix $\large \boldsymbol {W_2}$ and bias vector $\large \boldsymbol {B_2}$ . The equations would be as follows.

\large \begin{aligned} Z_{2_{\enspace(3,1)}} &= W_{2_{\enspace(3,3)}} A_{1_{\enspace(3,1)}} &+B_{2_{\enspace(3,1)}} \\ Z_{2_{\enspace(3,1)}} &= [{W_2A_1}]_{_{\enspace(3,1)}} &+ B_{2_{\enspace(3,1)}} \\ \\ A_{2_{\enspace(3,1)}} &=Sigmoid(Z_{2{_{\enspace(3,1)}}}) \end{aligned}

where,

\gdef\layer{2} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \large W_\layer = \begin{bmatrix} \red \w\layer11 \enspace \w\layer21 \enspace \w\layer31 \\\\ \green \w\layer12 \enspace \w\layer22 \enspace \w\layer32 \\\\ \blue \w\layer13 \enspace \w\layer23 \enspace \w\layer33 \end{bmatrix} , A_1 = \begin{bmatrix} \brown \a11 \\\\ \brown \a12 \\\\ \brown \a13 \end{bmatrix} , B_2 = \begin{bmatrix} \red \b\layer1\\\\ \green \b\layer2\\\\ \blue \b\layer3 \end{bmatrix} , Z_2 = \begin{bmatrix} \red \z21 \\\\ \green \z22\\\\ \blue \z23 \end{bmatrix}

Output Layer

Non-vectorized $\longrightarrow$ vectorized:

\gdef\layer{3} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \large \begin{aligned} %================================ {\red \w\layer11}{\brown \a21} + {\red \w\layer21}{\brown \a22} + {\red \w\layer31}{\brown \a23} + {\red \b\layer1} = {\red \z\layer1} %================================ \end{aligned} \gdef\layer{3} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \longrightarrow %============ \begin{bmatrix} \red \w\layer11 \enspace \w\layer21 \enspace \w\layer31 \\ \end{bmatrix} %============ \begin{bmatrix} \brown \a21 \\\\ \brown \a22 \\\\ \brown \a23 \end{bmatrix} %============ + %============ \begin{bmatrix} \red \b\layer1\\ \end{bmatrix} %============ = %============ \begin{bmatrix} \red \z\layer1 \end{bmatrix} %============

For this layer, the inputs are, $\large \boldsymbol {A_2}$ (output from the previous layer ), the weight matrix $\large \boldsymbol {W_3}$ and bias vector $\large \boldsymbol {B_3}$ . The equations would be as follows.

\gdef\layer{3} \gdef\prevlayer{2} \large \begin{aligned} Z_{\layer_{\enspace(1,1)}} &= W_{\layer_{\enspace(1,3)}} A_{\prevlayer_{\enspace(3,1)}} &+B_{\layer_{\enspace(1,1)}} \\ Z_{\layer_{\enspace(1,1)}} &= [{{W_\layer}{A_\prevlayer}}]_{_{\enspace(1,1)}} &+ B_{\layer_{\enspace(1,1)}} \end{aligned}

where,

\gdef\layer{3} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \large W_\layer = \begin{bmatrix} \red \w\layer11 \enspace \w\layer21 \enspace \w\layer31 \\ \end{bmatrix} , A_2 = \begin{bmatrix} \brown \a21 \\\\ \brown \a22 \\\\ \brown \a23 \end{bmatrix} , B_\layer = \begin{bmatrix} \red \b\layer1 \end{bmatrix} , Z_\layer = \begin{bmatrix} \red \z\layer1 \end{bmatrix}

As it is the $Output\ layer$ , the output from this layer is nothing but the predicted value(s) for the given input example. As we are dealing with binary classification, we have just one neuron with $Sigmoid$ activation function.

\gdef\layer{3} \gdef\prevlayer{2} \large A_{\layer_{\enspace(1,1)}} = Sigmoid(Z_{\layer{_{\enspace(1,1)}}})

The output value is a single value between 0 and 1. We can use this as a probability score to predict the given label for an appropriate threshold during inference (test time). While training, we will calculate the $Cross\ Entropy$ for all the input examples.

\begin{aligned} Cross \ Entropy, \ \mathit L &= -\sum_{i=1}^{m} y_i\log{\hat {y_i}} \\ &= -\sum_{i=1}^{m} y_i\log{(a^3_1)} \end{aligned}

But current example, we have only one input example, so $m=1$ . The whole process with multiple batch would be mentioned at the end, even though it doesn't change much of the equations around here except the shape of input at each layer. ( More would be discussed later. )

Summary

Is that Loss function correct ?

For people, who know about log loss and cross entropy, might get confused with this. As I've used cross-entropy even when we have one neuron. This is just to keep the notation constant. All you have to do to make it work for cross-entropy is to create probability for the other class, by subtracting 1 from the output.

And for the loss function also holds same for log lossand cross-entropy. Please do little research on this, you will come to know that both are same.

Task:

Write a class that takes #input_dimensions, #hidden_layers, #hidden_layer_size(number of neurons in the hidden layer), #output_dimensions ( 1 in this example with sigmoid activation ).
Initialize all of them as class variables.
As you already know the dimensions of each weight and bias matrices, initialize them randomly.
Write a function call() that takes the input of any batch_size, and return the output.
You can check the shape of the output to cross-verify your code.

PreviousOverview NextBack propagation

Last updated 5 years ago

Was this helpful?

Understanding Notations

Layer - 1 ( First hidden layer )

Non-vectorized ⟶\longrightarrow⟶ Vectorized :

why A0\boldsymbol A_0A0​instead of X\boldsymbol XXfor the input?

Layer 2 ( Second Hidden layer )

Non-vectorized ⟶\longrightarrow ⟶ Vectorized:

Output Layer

Non-vectorized ⟶\longrightarrow ⟶ vectorized:

Summary

Is that Loss function correct ?

Task:

Non-vectorized $\longrightarrow$ Vectorized :

why $\boldsymbol A_0$ instead of $\boldsymbol X$ for the input?

Non-vectorized $\longrightarrow$ Vectorized:

Non-vectorized $\longrightarrow$ vectorized: