AI Free Basic Course | Lecture 24 | Basics of Convolutional Neural Network (CNN)
AI Free Basic Course | Lecture 24 | Basics of Convolutional Neural
Network (CNN)
In the last lecture importance and value of neural networks was
discussed. We are targeting the high value projects having value above
30,000/-. It is advised that you must not directly into these projects. You
need to set goals and move systematically towards it. You have to decide from
where I am going to start. It can either be prompt engineering or any other
field. In order to achieve your goal, allocate a specific portion of your day
towards achievement of these goals. The way to move forward were asked by
people from non-technical background. My response to them is that you need not
understand everything. We must know the difference between functionalities of
Artificial Neural Network (ANN) and Convolutional
Neural Network (CNN) CNN). We must understand the reasons why the tasks
that could not be performed by Artificial Neural Network (ANN) can be performed
by Convolutional Neural Network (CNN) (CNN). Let us explain this, when you are reading on image over an object,
the Convolutional Neural Network (CNN) (CNN) is the best choice. When compared
with Artificial Network (ANN), Convolutional Neural
Network (CNN) is faster,
consumes less computational power, and is more effective and more accurate in
certain cases. Convolutional Neural Network (CNN) is the part of deep learning.
Suppose a client asks us to identify the object or
confirm whether it is available there or has been stolen by scanning through
CCTVE camera.
The conventional way to respond the client is that we
will watch the footage on DVR and inform the client about the situation.
Although such CCTV cameras are now available in the market that relay the
notification live to the client that the item is being stolen right now. What
it does? It scans the face of the unknown person, matches with the images
family tree already stored. If it does not match, a push notification is sent
to the client through app. All this has become possible due to Convolutional Neural Network (CNN).
So we have studied the limitations Artificial Neural
Network (ANN) which have been overcome by Convolutional
Neural Network (CNN).
In last lecture, we studied the standard Neural Networks and named it as
fully connected neural network. We gave it the amnist data which comprises
pictures, the accuracy of these Neural Networks was 90%. So the question
arises, in the presence of Neural Networks that are generating excellent
accuracy of 90%, whey there is need to opt for other form of neural network.
Let us discuss why we need it.
The data of amnist constitutes the smaller picture of 28 x 28 (784)
vectors. What if the size of input color image is 1000 x 1000? (Although it is
not so big image). As you know the color image has information about red, green
and blue colors in it. So the vector of color image would be 1000 x 1000) x 3 =
3000,000. SO by increasing the resolution little bit the image of 3 million
vectors formed.
If there are 1000 neurons. As we know, in fully connected network, all
these neurons would be adjusted in the first row. Now we have input of 3
million and when given to the first row of fully connected network comprising
1000 neurons would result in weight of 3 billion. This is the story of first
row and we know that there could be lot of layers in fully connected network. Although
the fully connected neural network can perform even in case of image of such
high resolution but the computational cost in this case would be very high. So
fully connected network became very expensive.
So there is need to develop such neural network that could process large
images and videos efficiently.
Now we move towards structure of Convolutional
Neural Network (CNN) and the
idea behind developing the structure of Convolutional
Neural Network (CNN).
What is the first thing that help us to recognize the
object in front of our eyes? In designing the neural network inspiration was
taken from the idea of working mechanism of human brain. But remember we can
just assume the working mechanism of human brain and cannot apprehend what is
exactly working behind all this working of human brain.
As told above, neural network has been designed by
imagining the working of human brain. In brain there is part called visual
cortex. The visual cortex
is the primary cortical region of the
brain that receives, integrates, and processes visual information relayed from
the retinas.
Keeping in view the working mechanism of visual cortex,
structure of Convolutional Neural Network (CNN) has been designed.
There is proverb, “The eyes cannot see what the mind does not know”.
The eyes are
useless when the mind is blind, literally translates to the fact that, having
a blind belief on something with lack of logic, prevents our eyes from seeing
the truth even if it blatantly obvious, right in front of us. |
Suppose a mind that has never seen a bottle before,
watch it first time in his life. My mind would try to judge this bottle layer
by layer. Mind of that person will try to judge its dimensions, its size, its
color and its purpose etc. Our mind start to make folds like v1, v2, v3 and so
on to reach a final conclusion about its dimension, size, color , usage etc on
the basis of results of these folds the object itself.
The mind start with making a simple reflection and
make it complex by adding different features in it and finally reach at the conclusion
of recognition of that image.
This is the pattern upon which the basic structure of
CNN has been developed.
So we apply filters over an image to recognize it. But
the number of filters are not fixed and can vary from image to image.
Now let us discuss how computer sees the image. For computer the image
is set of numbers. For computer an image is matrix of numbers. A picture of a
cat that is taken in day light would have different matrix of numbers as
compared to the picture that is taken at night.
So before 2012 it was very difficult for the computer to recognize an
image just from numbers.
When we represent an image, we use three primary colors like Red, Green
and Blue. Screen of your computer is a matrix that comprises different matrix
of Red Green and Blue. Intensity of these colors could be different as per
requirement. A color image is formed after blending these three primary colors
and our eye recognize it as color image. It is the ratio of blending these
three primary colors that makes different colors like yellow, pink, magenta,
cyan etc etc. By using the similar technique we store colors in computer which
are represented by different ration of numbers.
So question arises how computers recognize an image? As discussed in the
image below, human eye recognizes an image through its features. We have to
decide how these features be given to computer enabling it to recognize the
object.
Convolutional
in Convolutional Neural Network (CNN)is like operation of math like addition, subtraction ,
multiplication and division etc. If it is like math operations then why we have
designed ConvolutionalNeural Network instead of adopting any of math
operation. So Convolutional Neural Network must have some important feature in it
to work efficiently. If fact before 2012, all classical processing of image was
made through Convolution operation.
Suppose of a black and white image. No Red Green and Blue image is
present in black and white image. There are only two colors black and white. So
only one matrix would be required.
Now look at the two matrix in above diagram. One is large in size and
other is small in size. One yellow shape is sliding over the large matrix. The
sliding shape is called kernel or filter. As the kernel slides the figures in
the small matrix appear accordingly. It looks that someone is deriving figures
and storing it in a basket.
The green matrix is image and yellow widnow gliding over it is called
kernel or filter. We will use the word filter for this during the rest of the
lecture.
When we placed the filter at a portion of the image, it produced the
figure 4 at the first top left cell of the small matrix. As we glide the filter
over the green image, it produces a figure in cell of small matrix after
performing some calculation. The process continues to produce a new figure in a
cell as we glide the filter on our green image. At the end we form a matrix.
The process or operation is called convolution.
Now question arises why it is so important. The figures mentioned in the
kernel are very important and the people have done Phds in finding these
figures.
Remember the size of kernel can be vary and it is not always in
dimension of 3 x 3. It can be 2x2 4x4 etc depending upon the situation.
The magenta color matrix produced resultantly is called feature map. So
the answer produced by applying the filter on the image is called feature map.
Let us explain it with the effects of applying filter on the image. What
happens when we apply the filter on the image in Photoshop? The original image
sometimes becomes prominent, dim or sharpen, lines become dominant, engraves
and so on. Same is the case of applying the filter here.
The numbers mentioned in the filter or kernel are of much importance. It
produces different results when we apply it by changing the numbers in it.
Look at the picture given
above, what would happen, if instead of standing the dog changes its position
and sit on the ground.
Answer is that if a filter
can detect an image, the location of the face does not matter. There would be
no effect, either it is on the top left, or right bottom. The values of the
face remain same. So first point is that as the filter is being applied on the
whole image so it will detect it and second point is that it will display it on
the feature map irrespective of the position of the image.
For example a poster has
been pasted on your wall. The poster has multiple scattered pictures of dog,
cat, birds and buildings on it. Yours eyes will detect these images
irrespective of the position of the images. So it means it does not matter
whether the dog is standing or siting, the filter will detect it, although
there would be little change in number values.
Remember we will start
applying the filter from the top left always.
Another advantage of the
filter is that the values in the filter does not multiply with all values of
image at a given time as was in the fully connected neural network where the
neurons passed the 3 billion at the first layer.
In past the filters were
named on the name of scientists who invented them after lot of research, for
example sobel filter, It
is named after Irwin Sobel and Gary Mfilter.
Question here arises,
whether we also have to do lot of research for developing filters. Answer is
that No. The coevolutionary Neural Network has changed the phenomenon. We learn
filters from Coevolutionary Neural Network.
The coevolutionary network
will take the image as input and feature map as output, will train itself from
back propagation and will explain what would be generated if the said filter is
applied. So it is that coevolutionary neural network learns. It learns the
filters when are applied on the images,
computer will classify by extracting the useful information from the image.
Here is the 6x6 image. We
have a filter that is used to detect the vertical lines. When the filter is
applied it has given out put by detect two vertical grey lines and zero is on the
edges and 30 in the middle.
We apply minimum and maximum
function here. We apply zero to minimum and 255 to maximum. We have applied on
part of the image and after multiple attempts it will detect the picture at the
end. And at this stage our kernel or filer has been trained. The output of our
CNN is the trained kernel and this would be our main contribution.
We have given input image of
cat to the convolution network. The convolution network has to respond you
whether it cat or non cat.
The convolutional model
convolves it and evolved it into a feature map which will pass to next layer.
But the feature map would have large images which would have lot of colors and
redundant information as well. So, we reduce the size of feature map after
convolution stage. We cut into half or do the dimensionality reduction. The
process of reducing the size of feature map is called pooling or subsampling.
To apply dimensionality
reduction there are two methods. One is more popular and other is less popular.
The popular method is called max pooling. Max pooling means if we will pick the
larger number out of four numbers.
TYPES OF POOLING |
You can see we have reduced
the 4 x 4 feature map into 2 x 2.
·
The max pooling matrix picks the larger number. It is the most popular
method of dimensionality reduction.
·
The average pooling matrix would calculate the average of four numbers.
It is second preferred method after max pooling.
·
The sum pooling matrix would sum the four numbers.
Now here is the CNN in its full and final form:
The batch size
is a number of samples processed before
the model is updated. The number of
epochs is the number of complete passes through the training dataset. The size
of a batch must be more than or equal to one and less than or equal to the number
of samples in the training dataset
The activation
function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation
function is to introduce non-linearity into
the output of a neuron.
In order for Gradient Descent to work, we must set the
learning rate to an appropriate value. This parameter determines how fast or slow we will move towards the optimal
weights. If the learning
rate is very large we will skip the optimal solution.
That is, Softmax
assigns decimal probabilities to each class in a multi-class problem. Those
decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would. Softmax is implemented through a neural
network layer just before the output layer.
Nice you have explained very well and godd effort ..
ReplyDelete