Topic 22 | Neural Networks |

 

In today lecture we will continue the Neural Network; in last lecture we gave the basic idea about what is to expect from neural network. Today we will explain the hyper parameters and its components. We will see the affect of the hyper parameter on the neural network. We will use playground and practice on notebook. At the end of lecture, we will implement a basic network using Tensor Flow. Today we will go through the things and would discuss the notebook in detail in tomorrow’s lecture that what does construct means in dense of Tensor flow, what does the layers do and what is loss function. What are possibilities to add in hidden layer and output layer.

After practicing in TensorFlow you will get better idea about things. So, we will implement a neural network in TensorFlow and run it to see the outcome.

Suppose we have few balls and our target is to through them in a basket placed at a distance of 10 meter. When we through the first ball, there is possibility that either it will fall next to basket or before the basket. After throwing the second ball you will change the trajectory, the angle will be adjusted to set the parameter accordingly so that the ball may fall into the basket. By bringing the small changes in angle of trajectory the ball fall near the basket and in case you increase the angle of the ball it will fall far away from basket. The changes we are making in value of trajectory is a parameter. The basket at the distance of 10 meters is a fixed parameter. Number of attempts you make is also a parameter which is not being changed. But there is a trajectory parameter which is being adjusted.

The parameters we change while training neural network are called hyper parameters and the adjustments, we make to improve the trajectory to make it possible to fall the ball in the basket is called learning rate. The special thing about hyper parameters is that you can’t learn it from data. Parameter are the internal values of neurons that are being learnt from data during the training of neurons. While hyper parameters are something related to common sense. These are learnt from trial-and-error technique instead from data. As quoted in example of throwing ball, the trajectory we decide is not learnt from the available data, it is the common sense that will help to decide the level of trajectory, in the same way while designing computer vision, it is the common sense that would help us to decide which parameters are to select for the purpose. Similarly different parameters would be required to make classification of logs and make Natural Language Processing (NLP). So, we can say that the choosing hyper parameters is an art rather than science.

For example, the residents of a desert or forest, while travelling can sense the dangerous areas on their way and can adjust their route and timing accordingly. This sense has developed in them after spending time in that area and facing different situations time and again. The people new in the area would not have that sense. So, after making the neural networks again and again a sense would be developed which would guide us that which hyper parameter would produce better results. So, remember the parameters are learnt from data and hyper parameters are learnt from trial and error and experience. Learning rate in example quoted above is also a hyper parameter.

Following are the possible parameters in the example of basketball:

  1. The force with which we through the ball
  2. The height of the ball
  3. The distance between us and the basket
  4. The height of the basketball pole although fixed is also a parameter

In first attempt, if we throw the ball with certain force and height and if the trajectory built at the result did not produce the desired results, we will gradually adjust our force and height time and again based on our previous experience, to achieve the target of throwing the ball in the basket. So, it depends upon what is our ultimate goal and which hyper parameters would be required to achieve those goals.

Remember don’t get confuse while discussing the hyper parameters with the penalty we studied in reinforcement. When concept of fine tuning or penalty is applied on that  model which has already been trained and we have limited access to further play with it. The model will now easily learn in case new data is giving to it for training it.

There is difference when we teach the history to a person who is masters in English and the person who is illiterate. To teach history to an illiterate person first of all he has to learn English first then would be able to gain the knowledge of history. So fine tuning is like teaching a literate person who will not take much time and effort to learn new skills. Training a neural network is like teaching from scratch to a person who  has no previous background and it would take much more time and effort to teach him the new skills.

Hyper parameters have been circled in the diagram above.  Remember these are few of hyper parameters and remaining would be taught in due course of time in coming lectures.

 

Gradient Descent terminology comprises two words, Gradient means the slope and Descent means downwards. For example, stepping down from a roof is an example of Gradient Descent.

Now we discuss the relation between the learning rate and the Gradient Descent and explain it with the help of an example. Suppose you are standing on the peak of a mountain with your eyes covered with the ribbon and your goal is to reach the lowest point. The peak of mountain is the point from where we start our neural network to train and it has maximum value of loss. Our target is to move towards a point which has the minimum value of loss. However, we don’t know how to reach there as we can’t see due to ribbon on our eyes. One way to reach there is that you must be a superman so that you many not hurt if slipped during your journey to moving downward. Other practical solution is that you start moving randomly and choose that way where you feel that you are moving down. However, you also have to decide the pace with which you decide to move downwards. Either you choose to jump like kangaroo or move slowly with small steps until you reach at your desired destination. When you jump like kangaroo there is possibility that you may not know about the desired destination and you start to move up towards another mountain after reaching on the ground. As your goal is to reach on the ground and not to climb up towards another mountain. Your loss in this case will start increasing.

If you climb up towards another mountain and realize that you are moving upward and you jump to come down but instead of reaching ground you reach another point on the previous mountain. However, while completing the journey in small steps and with slow pace this will not happen. In this case as soon as you step up after reaching on the ground you realize about gaining height and you immediately step back to reach the desired point on the ground.

The size of jump or step you take to reach at your desired point on the ground is called learning rate.

Let us explain it with another example. Suppose you have a rod of 3 feet to measure the distance and the total distance is of 11 feet. Remember your eyes are covered with a ribbon. What will happen after taking four steps of 3 feet you will cover the distance of 12 feet and will cross you desired point by one foot. However, when we cover the distance with the help of rod that is 1 foot long. What will happen now, you will stop after measuring 11 times and will not lose your desired point.

Moving toward the lowest point is called convergence so when we converge it means we have reached the lowest point at valley between two mountains. When the step size is larger or long rod is used there is possibility that we may miss the base point or lowest point of the valley and on the other side when we use small steps, although it will take more time to converge, but we can avoid the chance of missing the lowest point and cross it and start moving upward again. So, the step size is called learning rate. At the convergence the process of training the neural network completes.

 
The convergence side of the above figures shows the completion of the journey towards lowest points with small steps. The Divergence side shows the journey towards lowest point with large steps. Here we can see that at divergence when we take large steps, we miss the base point and move upward on the other side. When we try to come back, we again miss the base point and oscillate between these points repeatedly.

As discussed above, learning rate is the size of step or jump.

Suppose you are going to train the neural network and you give the model the huge amount of data. What happens at this stage?  The computer used for the purpose may have not enough memory to save the data for training and utilizing this data to generate output.  For example, class instructor suddenly announces the test of all chapters of the book, the brain of the student will be overwhelmed in this case similarly the computer memory also gets overwhelmed.

In order to make the learning process easy, we divide the syllabus in to small parts and take test of students step by step. Instead of giving the huge amount of data or images to the model the model is given with small chunks of data, this process of dividing the huge amount of data is called batching.

Batch size is also a parameter. If the students are intelligent then large part of syllabus will be given to the student. In this case less time will be consumed. Epoch is like a whole book. We can divide it into chapters for training the model.

Instead of learning chapter by chapter of book, we can study the whole book again and again to get complete knowledge of it. Similarly, the Epochs can be processed in whole again and again. We first trained the model with 1000 images then again with same 1000 images and so on. The number of times we trained the model with 1000 images is called number of Epochs. One model may learn after 8-Epochs (08-attempts) and other may learn after 20-Epochs.  The computer vision networks sometime need 100-epochs to train them. That is experience which will work here. So, we call batch size a parameter and the number of Epochs the other hyper parameter.

So far, we have discussed following hyper parameters:

  1. Learning rate
  2. Batch size
  3. Number of Epochs 


Look at the picture already discussed. Every neuron in this picture will make decision on the basis of data given to it. The value of each decision taken by each neuron is passed to the neuron next to it. The next neuron on the basis of decision value provide to it will add another decision value and pass it to neuron next to it. So, the neurons in the first layers comprise on simple values that will get complicated layer by layer. The neuron at the ultimate end will be the super neuron that will make decision on the basis of values added by all neurons. Simply we can say that it will decide whether the image given is cat or dog.

Remember the concept of hidden layer and output layer must be clear in your mind. The decision-making neurons are called output neurons and the input processing neurons are called hidden neurons and the neurons giving the input are called input neurons. So, in neural network there are three types of layers.

We have to decide which type of neurons should be used at each type of layer. Suppose we decide that at hidden layers we would always use the RELO, the popular max Function. If your output neurons have to make decision on binary classification, we would use sigmoid at output neurons. We used this during discussing about the logistic regression.

After binary classification now move towards more than two types of inputs out of which the model has to identify. Suppose we have three shapes, square, triangle and circles as input out of which model has to identify which type of shape it is. Now we have to go beyond the binary classification of Yes and No done by the output layer so far. In order to identify the items of more than two types we will use SoftMax. We can say that SoftMax is the council of judges which assign values to each out put as per their knowledge. The maximum value giving by any judge will be our output. Number of Neuron in SoftMax will depend upon the number of classes in input. Suppose if there are three classes the number of SoftMax neurons would also be three.

 







Comments

Popular posts from this blog

Topic-18 | Evaluation Metrics for different model

Topic 17 | Linear regression using Sklearn