Before jumping into the implementation, I would suggest you to watch this video, just to refresh your memory on how Neural Network works. This is part of a playlist. And I will refer all those videos wherever necessary. For now, you can watch this video and move on to the article.
Basics of Neural networks and forward propagation.
Assuming that you understood the video, let's take a closer look at the network once again with annotated parameters for the input layer, first hidden layer and the weights that connecting them.
Understanding Notations
We will considerH1āas our current layer and when we talk w.r.t H1ā, thenIwould become the previous layer.
In this notation, superscript always denotes the layer number and the subscript always related to the node in a particular layer.
From the above diagram,
w111āĀ ā weight of an edge that connects the 1st neuron in the previous layer(lā1)to the 1st neuron in the present layer(1).
w211āĀ ā weight of an edge that connects the 2nd neuron in the previous layer(lā1)to the 1st neuron in the present layer(1).
w121āā weight of an edge that connects 1st neuron from the previous layer(lā1)to the 2nd neuron in the present layer (1) .
w221āā weight of an edge that connects 2nd neuron from the previous layer(lā1)to the 2nd neuron in the current layer(1).
...
..
.
z11ā,a11āā pre-activation and post-activation of 1stlayer's 1st neuron
z21ā,a21āā pre-activation and post-activation of 1st layer's 2nd neuron
z31ā,a31āā pre-activation and post-activation of 1st layer's 3rd neuron
We can replace the zā²svector with Z1ā (1 in the denominator indicates the layer number ie., 1 in this case), weight matrix with W1ā, input with A0āand finally bias vector for this layer as W1ā.
The denominator indicates the layer number ie., 1 in this case
why A0āinstead of Xfor the input?
During forward propagation in further layers, the input would be the the output from previous layers which would be denoted as A(lā1)ā , where (lā1) is the previous layer index. ie.,A1āis the input for second hidden layer, and for 3rd hidden layer, the input is A2āand so on.
Just to make the notation consistent across the network, we considerA0āinstead ofX
Finally, we can replace the above equations with the below vector notation.
Now, post activationA1āis just a function of Z1ā , which can be of any non-linear function such as Sigmoid , Tanh, ReLU, LeakyĀ ReLU. In this case, I would be sticking with Sigmoid.
For this layer, the inputs are, A1ā(output from the previous layer ), the weight matrix W2āand bias vector B2ā. The equations would be as follows.
For this layer, the inputs are, A2ā(output from the previous layer ), the weight matrix W3āand bias vector B3ā. The equations would be as follows.
As it is the OutputĀ layer , the output from this layer is nothing but the predicted value(s) for the given input example. As we are dealing with binary classification, we have just one neuron with Sigmoid activation function.
A3(1,1)āā=Sigmoid(Z3(1,1)āā)
The output value is a single value between 0 and 1. We can use this as a probability score to predict the given label for an appropriate threshold during inference (test time). While training, we will calculate the CrossĀ Entropy for all the input examples.
But current example, we have only one input example, so m=1. The whole process with multiple batch would be mentioned at the end, even though it doesn't change much of the equations around here except the shape of input at each layer. ( More would be discussed later. )
Summary
Is that Loss function correct ?
For people, who know about log loss and cross entropy, might get confused with this. As I've used cross-entropy even when we have one neuron. This is just to keep the notation constant. All you have to do to make it work for cross-entropy is to create probability for the other class, by subtracting 1 from the output.
And for the loss function also holds same for log lossand cross-entropy. Please do little research on this, you will come to know that both are same.
Task:
Write a class that takes #input_dimensions, #hidden_layers, #hidden_layer_size(number of neurons in the hidden layer), #output_dimensions ( 1 in this example with sigmoid activation ).
Initialize all of them as class variables.
As you already know the dimensions of each weight and bias matrices, initialize them randomly.
Write a function call() that takes the input of any batch_size, and return the output.
You can check the shape of the output to cross-verify your code.