Before jumping into the implementation, I would suggest you to watch this video, just to refresh your memory on how Neural Network works. This is part of a playlist. And I will refer all those videos wherever necessary. For now, you can watch this video and move on to the article.
Assuming that you understood the video, let's take a closer look at the network once again with annotated parameters for the input layer, first hidden layer and the weights that connecting them.
Understanding Notations
In this notation, superscript always denotes the layer number and the subscript always related to the node in a particular layer.
From the above diagram,
Layer - 1 ( First hidden layer )
The pre-activations for the first layer can be computed as follows (assuming all the
Finally, we can replace the above equations with the below vector notation.
where,
Layer 2 ( Second Hidden layer )
We will look at the second layer, But as we have already discussed on how to vectorize the operations, we will mention the matrices directly.
where,
Output Layer
where,
Summary
Is that Loss function correct ?
For people, who know about log loss and cross entropy, might get confused with this. As I've used cross-entropy even when we have one neuron. This is just to keep the notation constant. All you have to do to make it work for cross-entropy is to create probability for the other class, by subtracting 1 from the output.
And for the loss function also holds same for log lossand cross-entropy. Please do little research on this, you will come to know that both are same.
Task:
Write a class that takes #input_dimensions, #hidden_layers, #hidden_layer_size(number of neurons in the hidden layer), #output_dimensions ( 1 in this example with sigmoid activation ).
Initialize all of them as class variables.
As you already know the dimensions of each weight and bias matrices, initialize them randomly.
Write a function call() that takes the input of any batch_size, and return the output.
You can check the shape of the output to cross-verify your code.
We will considerH1as our current layer and when we talk w.r.t H1, thenIwould become the previous layer.
w111→ weight of an edge that connects the 1st neuron in the previous layer(l−1)to the 1st neuron in the present layer(1).
w211→ weight of an edge that connects the 2nd neuron in the previous layer(l−1)to the 1st neuron in the present layer(1).
w121→ weight of an edge that connects 1st neuron from the previous layer(l−1)to the 2nd neuron in the present layer (1) .
w221→ weight of an edge that connects 2nd neuron from the previous layer(l−1)to the 2nd neuron in the current layer(1).
...
..
.
z11,a11→ pre-activation and post-activation of 1stlayer's 1st neuron
z21,a21→ pre-activation and post-activation of 1st layer's 2nd neuron
z31,a31→ pre-activation and post-activation of 1st layer's 3rd neuron
In general,wi,jl→weight between (l−1)th layer’s (i)th node and (l)th layer’s (j)thnode.bil→bias for (l)th layer’s (i)thnode. zil→pre-activation of (l)th layer’s (i)thnode. \boldsymbol {a^l_i} \rightarrow \text {post-activation of } \boldsymbol {(l)^{th}} \text { layer's } \boldsymbol {(i)^{th}} \text {node. } \\ \hspace3em \text {This is also an input to the next layer.}
We can replace the z′svector with Z1 (1 in the denominator indicates the layer number ie., 1 in this case), weight matrix with W1, input with A0and finally bias vector for this layer as W1.
The denominator indicates the layer number ie., 1 in this case
why A0instead of Xfor the input?
During forward propagation in further layers, the input would be the the output from previous layers which would be denoted as A(l−1) , where (l−1) is the previous layer index. ie.,A1is the input for second hidden layer, and for 3rd hidden layer, the input is A2and so on.
Just to make the notation consistent across the network, we considerA0instead ofX
Now, post activationA1is just a function of Z1 , which can be of any non-linear function such as Sigmoid , Tanh, ReLU, Leaky ReLU. In this case, I would be sticking with Sigmoid.
As it is the Outputlayer , the output from this layer is nothing but the predicted value(s) for the given input example. As we are dealing with binary classification, we have just one neuron with Sigmoid activation function.
A3(1,1)=Sigmoid(Z3(1,1))
The output value is a single value between 0 and 1. We can use this as a probability score to predict the given label for an appropriate threshold during inference (test time). While training, we will calculate the CrossEntropy for all the input examples.
But current example, we have only one input example, so m=1. The whole process with multiple batch would be mentioned at the end, even though it doesn't change much of the equations around here except the shape of input at each layer. ( More would be discussed later. )