Forward propagation

Before jumping into the implementation, I would suggest you to watch this video, just to refresh your memory on how Neural Network works. This is part of a playlist. And I will refer all those videos wherever necessary. For now, you can watch this video and move on to the article.

Basics of Neural networks and forward propagation.

Assuming that you understood the video, let's take a closer look at the network once again with annotated parameters for the input layer, first hidden layer and the weights that connecting them.

Understanding Notations

We will considerH1\boldsymbol H_1as our current layer and when we talk w.r.t H1\boldsymbol H_1, thenI\boldsymbol Iwould become the previous layer.

In this notation, superscript always denotes the layer number and the subscript always related to the node in a particular layer.

From the above diagram,

w111 →w_{11}^{1} \ \rightarrow weight of an edge that connects the 1st1^{st} neuron in the previous layer(lāˆ’1)(l-1)to the 1st1^{st} neuron in the present layer(1)(1). w211 →w_{21}^{1} \ \rightarrow weight of an edge that connects the 2nd2^{nd} neuron in the previous layer(lāˆ’1)(l-1)to the 1st1^{st} neuron in the present layer(1)(1). w121→w_{12}^{1} \rightarrow weight of an edge that connects 1st1^{st} neuron from the previous layer(lāˆ’1)(l-1)to the 2nd2^{nd} neuron in the present layer (1)(1) . w221→w_{22}^{1} \rightarrow weight of an edge that connects 2nd2^{nd} neuron from the previous layer(lāˆ’1)(l-1)to the 2nd2^{nd} neuron in the current layer(1)(1). ... .. . z11,a11→z^1_1, a^1_1 \rightarrow pre-activation and post-activation of 1st1^{st}layer's 1st1^{st} neuron z21,a21→z^1_2, a^1_2 \rightarrow pre-activation and post-activation of 1st1^{st} layer's 2nd2^{nd} neuron z31,a31→z^1_3, a^1_3 \rightarrow pre-activation and post-activation of 1st1^{st} layer's 3rd3^{rd} neuron

In general, wi,jl→weightĀ betweenĀ (lāˆ’1)thĀ layer’sĀ (i)thĀ nodeĀ andĀ (l)thĀ layer’sĀ (j)thnode.\boldsymbol {w_{i,j}^{l}} \rightarrow \text {weight between } \boldsymbol {(l-1)^{th}} \text { layer's } \boldsymbol{(i)^{th}} \text { node } \text {and } \boldsymbol {(l)^{th}} \text { layer's } \boldsymbol {(j)^{th}} \text {node.} bil→biasĀ forĀ (l)thĀ layer’sĀ (i)thnode.Ā \boldsymbol {b^l_i} \rightarrow \text {bias for } \boldsymbol {(l)^{th}} \text { layer's } \boldsymbol {(i)^{th}} \text {node. } zil→pre-activationĀ ofĀ (l)thĀ layer’sĀ (i)thnode.Ā \boldsymbol {z^l_i} \rightarrow \text {pre-activation of } \boldsymbol {(l)^{th}} \text { layer's } \boldsymbol {(i)^{th}} \text {node. }

Layer - 1 ( First hidden layer )

Non-vectorized ⟶\longrightarrow Vectorized :

The pre-activations for the first layer can be computed as follows (assuming all the

w111x1+w211x2+b11=z11w121x1+w221x2+b21=z21w131x1+w231x2+b31=z31⟶(w111w211w121w221w131w231)(x1x2)+(b11b21b31)=(z11z21z31)\gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \large \begin{aligned} %================================ &{\red \w111}{\brown x_1} + {\red \w121}{\brown x_2} + {\red \b11} = {\red \z11}\\\\ % ================================ &{\green\w112}{\brown x_1} + {\green \w122}{\brown x_2} + {\green \b12} = {\green \z12}\\\\ %================================ &{\blue \w113}{\brown x_1} + {\blue \w123}{\brown x_2} + {\blue \b13} = {\blue \z13} %================================ \end{aligned} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \longrightarrow \large %============ \begin{pmatrix} \red \w111 \enspace \w121 \\\\ \green \w112 \enspace \w122 \\\\ \blue \w113 \enspace \w123 \end{pmatrix} %============ \begin{pmatrix} \brown x_1 \\\\ \brown x_2 \end{pmatrix} %============ + %============ \begin{pmatrix} \red \b11\\\\ \green \b12\\\\ \blue \b13 \end{pmatrix} %============ = %============ \begin{pmatrix} \red \z11 \\\\ \green \z12\\\\ \blue \z13 \end{pmatrix} %============

We can replace the z′sz's vector with Z1\boldsymbol {Z_1} (1 in the denominator indicates the layer number ie., 1 in this case), weight matrix with W1\boldsymbol {W_1}, input with A0\boldsymbol {A_0}and finally bias vector for this layer as W1\boldsymbol {W_1}. The denominator indicates the layer number ie., 11 in this case

why A0\boldsymbol A_0instead of X\boldsymbol Xfor the input?

During forward propagation in further layers, the input would be the the output from previous layers which would be denoted as A(lāˆ’1)\boldsymbol A_{(l-1)} , where (lāˆ’1)(l-1) is the previous layer index. ie.,A1\boldsymbol A_1is the input for second hidden layer, and for 3rd hidden layer, the input is A2\boldsymbol A_2and so on.

Just to make the notation consistent across the network, we considerA0\boldsymbol A_0instead ofX\boldsymbol X

Finally, we can replace the above equations with the below vector notation.

W1(3,2)A0(2,1)+B1(3,1)=Z1(3,1)[W1A0](3,1)+B1(3,1)=Z1(3,1)\large \begin{aligned} &W_{1_{\enspace(3,2)}} &A_0{_{\enspace(2,1)}} &+ B_{1_{\enspace(3,1)}} &= Z_{1_{\enspace(3,1)}}\\ &[{W_1}&{A_0}]_{_{\enspace(3,1)}} &+ B_{1_{\enspace(3,1)}} &= Z_{1_{\enspace(3,1)}} \end{aligned}

where,

W1=(w111w211w121w221w131w231),A0=(x1x2),B1=(b11b21b31),Z1=(z11z21z31)\gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \large W_1 = \begin{pmatrix} \red \w111 \enspace \w121 \\\\ \green \w112 \enspace \w122 \\\\ \blue \w113 \enspace \w123 \end{pmatrix}, A_0 = \begin{pmatrix} \brown x_1 \\\\ \brown x_2 \end{pmatrix} , B_1 = \begin{pmatrix} \red \b11\\\\ \green \b12\\\\ \blue \b13 \end{pmatrix}, Z_1 = \begin{pmatrix} \red \z11 \\\\ \green \z12\\\\ \blue \z13 \end{pmatrix}

Now, post activationA1\large \boldsymbol {A_1}is just a function of Z1\large \boldsymbol {Z_1} , which can be of any non-linear function such as Sigmoid\text {Sigmoid} , Tanh\text {Tanh}, ReLU\text {ReLU}, LeakyĀ ReLU\text {Leaky ReLU}. In this case, I would be sticking with Sigmoid\text {Sigmoid}.

A1=Sigmoid(Z1)=σ(Z1)=11+eāˆ’Z1\large \begin{aligned} A_1 &= Sigmoid(Z_1)\\ &= \sigma(Z_1)\\ &=\frac{1 }{1+e^{- Z_1}} \end{aligned}

Layer 2 ( Second Hidden layer )

We will look at the second layer, But as we have already discussed on how to vectorize the operations, we will mention the matrices directly.

Non-vectorized ⟶\longrightarrow Vectorized:

w112a11+w212a21+w312a31+b12=z12w122a11+w222a21+w322a31+b22=z22w132a11+w232a21+w332a31+b32=z32⟶(w112w212w312w122w222w322w132w232w332)(a11a21a31)+(b12b22b32)=(z12z22z32)\gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\layer{2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \large {\begin{aligned} %================================ &{\red \w211}{\brown \a11} + {\red \w221}{\brown \a12} + {\red \w231}{\brown \a13} + {\red \b21} = {\red \z21}\\\\ &{\green \w212}{\brown \a11} + {\green \w222}{\brown \a12} + {\green \w232}{\brown \a13} + {\green \b22} = {\green \z22} \\\\ &{\blue \w213}{\brown \a11} + {\blue \w223}{\brown \a12} + {\blue \w233}{\brown \a13} + {\green \b23} = {\green \z23} %================================ \end{aligned}} \longrightarrow {%============ \begin{pmatrix} \red \w\layer11 \enspace \w\layer21 \enspace \w\layer31 \\\\ \green \w\layer12 \enspace \w\layer22 \enspace \w\layer32 \\\\ \blue \w\layer13 \enspace \w\layer23 \enspace \w\layer33 \end{pmatrix} %============ \begin{pmatrix} \brown \a11 \\\\ \brown \a12 \\\\ \brown \a13 \end{pmatrix} %============ + %============ \begin{pmatrix} \red \b\layer1\\\\ \green \b\layer2\\\\ \blue \b\layer3 \end{pmatrix} %============ = %============ \begin{pmatrix} \red \z21 \\\\ \green \z22\\\\ \blue \z23 \end{pmatrix}} %============

For this layer, the inputs are, A1\large \boldsymbol {A_1}(output from the previous layer ), the weight matrix W2\large \boldsymbol {W_2}and bias vector B2\large \boldsymbol {B_2}. The equations would be as follows.

Z2(3,1)=W2(3,3)A1(3,1)+B2(3,1)Z2(3,1)=[W2A1](3,1)+B2(3,1)A2(3,1)=Sigmoid(Z2(3,1))\large \begin{aligned} Z_{2_{\enspace(3,1)}} &= W_{2_{\enspace(3,3)}} A_{1_{\enspace(3,1)}} &+B_{2_{\enspace(3,1)}} \\ Z_{2_{\enspace(3,1)}} &= [{W_2A_1}]_{_{\enspace(3,1)}} &+ B_{2_{\enspace(3,1)}} \\ \\ A_{2_{\enspace(3,1)}} &=Sigmoid(Z_{2{_{\enspace(3,1)}}}) \end{aligned}

where,

W2=[w112w212w312w122w222w322w132w232w332],A1=[a11a21a31],B2=[b12b22b32],Z2=[z12z22z32]\gdef\layer{2} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \gdef\greenhash{#7ED321} \gdef\green{\color{\greenhash}} \gdef\bluehash{#4A90E2} \gdef\blue{\color{\bluehash}} \large W_\layer = \begin{bmatrix} \red \w\layer11 \enspace \w\layer21 \enspace \w\layer31 \\\\ \green \w\layer12 \enspace \w\layer22 \enspace \w\layer32 \\\\ \blue \w\layer13 \enspace \w\layer23 \enspace \w\layer33 \end{bmatrix} , A_1 = \begin{bmatrix} \brown \a11 \\\\ \brown \a12 \\\\ \brown \a13 \end{bmatrix} , B_2 = \begin{bmatrix} \red \b\layer1\\\\ \green \b\layer2\\\\ \blue \b\layer3 \end{bmatrix} , Z_2 = \begin{bmatrix} \red \z21 \\\\ \green \z22\\\\ \blue \z23 \end{bmatrix}

Output Layer

Non-vectorized ⟶\longrightarrow vectorized:

w113a12+w213a22+w313a32+b13=z13⟶[w113w213w313][a12a22a32]+[b13]=[z13]\gdef\layer{3} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \large \begin{aligned} %================================ {\red \w\layer11}{\brown \a21} + {\red \w\layer21}{\brown \a22} + {\red \w\layer31}{\brown \a23} + {\red \b\layer1} = {\red \z\layer1} %================================ \end{aligned} \gdef\layer{3} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \longrightarrow %============ \begin{bmatrix} \red \w\layer11 \enspace \w\layer21 \enspace \w\layer31 \\ \end{bmatrix} %============ \begin{bmatrix} \brown \a21 \\\\ \brown \a22 \\\\ \brown \a23 \end{bmatrix} %============ + %============ \begin{bmatrix} \red \b\layer1\\ \end{bmatrix} %============ = %============ \begin{bmatrix} \red \z\layer1 \end{bmatrix} %============

For this layer, the inputs are, A2\large \boldsymbol {A_2}(output from the previous layer ), the weight matrix W3\large \boldsymbol {W_3}and bias vector B3\large \boldsymbol {B_3}. The equations would be as follows.

Z3(1,1)=W3(1,3)A2(3,1)+B3(1,1)Z3(1,1)=[W3A2](1,1)+B3(1,1)\gdef\layer{3} \gdef\prevlayer{2} \large \begin{aligned} Z_{\layer_{\enspace(1,1)}} &= W_{\layer_{\enspace(1,3)}} A_{\prevlayer_{\enspace(3,1)}} &+B_{\layer_{\enspace(1,1)}} \\ Z_{\layer_{\enspace(1,1)}} &= [{{W_\layer}{A_\prevlayer}}]_{_{\enspace(1,1)}} &+ B_{\layer_{\enspace(1,1)}} \end{aligned}

where,

W3=[w113w213w313],A2=[a12a22a32],B3=[b13],Z3=[z13]\gdef\layer{3} \gdef\w#1#2#3{w^#1_{#2#3}} \gdef\a#1#2{a^#1_#2} \gdef\b#1#2{b^#1_#2} \gdef\z#1#2{z^#1_#2} \gdef\redhash{#D0021B} \gdef\red{\color{\redhash}} \gdef\brownhash{#8B572A} \gdef\brown{\color{\brownhash}} \large W_\layer = \begin{bmatrix} \red \w\layer11 \enspace \w\layer21 \enspace \w\layer31 \\ \end{bmatrix} , A_2 = \begin{bmatrix} \brown \a21 \\\\ \brown \a22 \\\\ \brown \a23 \end{bmatrix} , B_\layer = \begin{bmatrix} \red \b\layer1 \end{bmatrix} , Z_\layer = \begin{bmatrix} \red \z\layer1 \end{bmatrix}

As it is the OutputĀ layerOutput\ layer , the output from this layer is nothing but the predicted value(s) for the given input example. As we are dealing with binary classification, we have just one neuron with SigmoidSigmoid activation function.

A3(1,1)=Sigmoid(Z3(1,1))\gdef\layer{3} \gdef\prevlayer{2} \large A_{\layer_{\enspace(1,1)}} = Sigmoid(Z_{\layer{_{\enspace(1,1)}}})

The output value is a single value between 0 and 1. We can use this as a probability score to predict the given label for an appropriate threshold during inference (test time). While training, we will calculate the CrossĀ EntropyCross\ Entropy for all the input examples.

Output layer with loss function
CrossĀ Entropy,Ā L=āˆ’āˆ‘i=1myilog⁔yi^=āˆ’āˆ‘i=1myilog⁔(a13)\begin{aligned} Cross \ Entropy, \ \mathit L &= -\sum_{i=1}^{m} y_i\log{\hat {y_i}} \\ &= -\sum_{i=1}^{m} y_i\log{(a^3_1)} \end{aligned}

But current example, we have only one input example, so m=1m=1. The whole process with multiple batch would be mentioned at the end, even though it doesn't change much of the equations around here except the shape of input at each layer. ( More would be discussed later. )

Summary

Is that Loss function correct ?

For people, who know about log loss and cross entropy, might get confused with this. As I've used cross-entropy even when we have one neuron. This is just to keep the notation constant. All you have to do to make it work for cross-entropy is to create probability for the other class, by subtracting 1 from the output.

And for the loss function also holds same for log lossand cross-entropy. Please do little research on this, you will come to know that both are same.

Task:

  • Write a class that takes #input_dimensions, #hidden_layers, #hidden_layer_size(number of neurons in the hidden layer), #output_dimensions ( 1 in this example with sigmoid activation ).

  • Initialize all of them as class variables.

  • As you already know the dimensions of each weight and bias matrices, initialize them randomly.

  • Write a function call() that takes the input of any batch_size, and return the output.

  • You can check the shape of the output to cross-verify your code.

Last updated

Was this helpful?