Forward propagation
Last updated
Last updated
Before jumping into the implementation, I would suggest you to watch this video, just to refresh your memory on how Neural Network works. This is part of a playlist. And I will refer all those videos wherever necessary. For now, you can watch this video and move on to the article.
Assuming that you understood the video, let's take a closer look at the network once again with annotated parameters for the input layer, first hidden layer and the weights that connecting them.
In this notation, superscript always denotes the layer number and the subscript always related to the node in a particular layer.
From the above diagram,
The pre-activations for the first layer can be computed as follows (assuming all the
Finally, we can replace the above equations with the below vector notation.
where,
We will look at the second layer, But as we have already discussed on how to vectorize the operations, we will mention the matrices directly.
where,
where,
For people, who know about log loss
and cross entropy
, might get confused with this. As I've used cross-entropy
even when we have one neuron. This is just to keep the notation constant. All you have to do to make it work for cross-entropy
is to create probability for the other class, by subtracting 1 from the output.
And for the loss function also holds same for log loss
and cross-entropy
. Please do little research on this, you will come to know that both are same.
Write a class that takes #input_dimensions, #hidden_layers, #hidden_layer_size(number of neurons in the hidden layer), #output_dimensions ( 1 in this example with sigmoid activation ).
Initialize all of them as class variables.
As you already know the dimensions of each weight and bias matrices, initialize them randomly.
Write a function call()
that takes the input of any batch_size, and return the output.
You can check the shape of the output to cross-verify your code.
We will consideras our current layer and when we talk w.r.t , thenwould become the previous layer.
weight of an edge that connects the neuron in the previous layerto the neuron in the present layer. weight of an edge that connects the neuron in the previous layerto the neuron in the present layer. weight of an edge that connects neuron from the previous layerto the neuron in the present layer . weight of an edge that connects neuron from the previous layerto the neuron in the current layer. ... .. . pre-activation and post-activation of layer's neuron pre-activation and post-activation of layer's neuron pre-activation and post-activation of layer's neuron
In general, \boldsymbol {a^l_i} \rightarrow \text {post-activation of } \boldsymbol {(l)^{th}} \text { layer's } \boldsymbol {(i)^{th}} \text {node. } \\ \hspace3em \text {This is also an input to the next layer.}
We can replace the vector with (1 in the denominator indicates the layer number ie., 1 in this case), weight matrix with , input with and finally bias vector for this layer as . The denominator indicates the layer number ie., in this case
During forward propagation in further layers, the input would be the the output from previous layers which would be denoted as , where is the previous layer index. ie.,is the input for second hidden layer, and for 3rd hidden layer, the input is and so on.
Just to make the notation consistent across the network, we considerinstead of
Now, post activationis just a function of , which can be of any non-linear function such as , , , . In this case, I would be sticking with .
For this layer, the inputs are, (output from the previous layer ), the weight matrix and bias vector . The equations would be as follows.
For this layer, the inputs are, (output from the previous layer ), the weight matrix and bias vector . The equations would be as follows.
As it is the , the output from this layer is nothing but the predicted value(s) for the given input example. As we are dealing with binary classification, we have just one neuron with activation function.
The output value is a single value between 0 and 1. We can use this as a probability score to predict the given label for an appropriate threshold during inference (test time). While training, we will calculate the for all the input examples.
But current example, we have only one input example, so . The whole process with multiple batch would be mentioned at the end, even though it doesn't change much of the equations around here except the shape of input at each layer. ( More would be discussed later. )