AI Free Basic Course | Lecture 24 | Basics of Convolutional Neural Network (CNN)

AI Free Basic Course | Lecture 24 | Basics of Convolutional Neural Network (CNN)

In the last lecture importance and value of neural networks was discussed. We are targeting the high value projects having value above 30,000/-. It is advised that you must not directly into these projects. You need to set goals and move systematically towards it. You have to decide from where I am going to start. It can either be prompt engineering or any other field. In order to achieve your goal, allocate a specific portion of your day towards achievement of these goals. The way to move forward were asked by people from non-technical background. My response to them is that you need not understand everything. We must know the difference between functionalities of Artificial Neural Network (ANN) and Convolutional Neural Network (CNN) CNN). We must understand the reasons why the tasks that could not be performed by Artificial Neural Network (ANN) can be performed by Convolutional Neural Network (CNN) (CNN). Let us explain this, when you are reading on image over an object, the Convolutional Neural Network (CNN) (CNN) is the best choice. When compared with Artificial Network (ANN), Convolutional Neural Network (CNN) is faster, consumes less computational power, and is more effective and more accurate in certain cases. Convolutional Neural Network (CNN) is the part of deep learning.

Suppose a client asks us to identify the object or confirm whether it is available there or has been stolen by scanning through CCTVE camera.

The conventional way to respond the client is that we will watch the footage on DVR and inform the client about the situation. Although such CCTV cameras are now available in the market that relay the notification live to the client that the item is being stolen right now. What it does? It scans the face of the unknown person, matches with the images family tree already stored. If it does not match, a push notification is sent to the client through app. All this has become possible due to Convolutional Neural Network (CNN).

So we have studied the limitations Artificial Neural Network (ANN) which have been overcome by Convolutional Neural Network (CNN). 

In last lecture, we studied the standard Neural Networks and named it as fully connected neural network. We gave it the amnist data which comprises pictures, the accuracy of these Neural Networks was 90%. So the question arises, in the presence of Neural Networks that are generating excellent accuracy of 90%, whey there is need to opt for other form of neural network. Let us discuss why we need it.

The data of amnist constitutes the smaller picture of 28 x 28 (784) vectors. What if the size of input color image is 1000 x 1000? (Although it is not so big image). As you know the color image has information about red, green and blue colors in it. So the vector of color image would be 1000 x 1000) x 3 = 3000,000. SO by increasing the resolution little bit the image of 3 million vectors formed.

If there are 1000 neurons. As we know, in fully connected network, all these neurons would be adjusted in the first row. Now we have input of 3 million and when given to the first row of fully connected network comprising 1000 neurons would result in weight of 3 billion. This is the story of first row and we know that there could be lot of layers in fully connected network. Although the fully connected neural network can perform even in case of image of such high resolution but the computational cost in this case would be very high. So fully connected network became very expensive.

So there is need to develop such neural network that could process large images and videos efficiently.

Now we move towards structure of Convolutional Neural Network (CNN) and the idea behind developing the structure of Convolutional Neural Network (CNN).

What is the first thing that help us to recognize the object in front of our eyes? In designing the neural network inspiration was taken from the idea of working mechanism of human brain. But remember we can just assume the working mechanism of human brain and cannot apprehend what is exactly working behind all this working of human brain.

As told above, neural network has been designed by imagining the working of human brain. In brain there is part called visual cortex. The visual cortex is the primary cortical region of the brain that receives, integrates, and processes visual information relayed from the retinas.

Keeping in view the working mechanism of visual cortex, structure of Convolutional Neural Network (CNN) has been designed.

There is proverb, “The eyes cannot see what the mind does not know.

 The eyes are useless when the mind is blind, literally translates to the fact that, having a blind belief on something with lack of logic, prevents our eyes from seeing the truth even if it blatantly obvious, right in front of us.


Suppose a mind that has never seen a bottle before, watch it first time in his life. My mind would try to judge this bottle layer by layer. Mind of that person will try to judge its dimensions, its size, its color and its purpose etc. Our mind start to make folds like v1, v2, v3 and so on to reach a final conclusion about its dimension, size, color , usage etc on the basis of results of these folds the object itself.

The mind start with making a simple reflection and make it complex by adding different features in it and finally reach at the conclusion of recognition of that image.

This is the pattern upon which the basic structure of CNN has been developed.

So we apply filters over an image to recognize it. But the number of filters are not fixed and can vary from image to image. 

Now let us discuss how computer sees the image. For computer the image is set of numbers. For computer an image is matrix of numbers. A picture of a cat that is taken in day light would have different matrix of numbers as compared to the picture that is taken at night.

So before 2012 it was very difficult for the computer to recognize an image just from numbers.

When we represent an image, we use three primary colors like Red, Green and Blue. Screen of your computer is a matrix that comprises different matrix of Red Green and Blue. Intensity of these colors could be different as per requirement. A color image is formed after blending these three primary colors and our eye recognize it as color image. It is the ratio of blending these three primary colors that makes different colors like yellow, pink, magenta, cyan etc etc. By using the similar technique we store colors in computer which are represented by different ration of numbers.

So question arises how computers recognize an image? As discussed in the image below, human eye recognizes an image through its features. We have to decide how these features be given to computer enabling it to recognize the object.

Convolutional in Convolutional Neural Network (CNN)is like operation of math like addition, subtraction , multiplication and division etc. If it is like math operations then why we have designed ConvolutionalNeural Network instead of adopting any of math operation. So Convolutional Neural Network must have some important feature in it to work efficiently. If fact before 2012, all classical processing of image was made through Convolution operation. 

Suppose of a black and white image. No Red Green and Blue image is present in black and white image. There are only two colors black and white. So only one matrix would be required.

Now look at the two matrix in above diagram. One is large in size and other is small in size. One yellow shape is sliding over the large matrix. The sliding shape is called kernel or filter. As the kernel slides the figures in the small matrix appear accordingly. It looks that someone is deriving figures and storing it in a basket.

The green matrix is image and yellow widnow gliding over it is called kernel or filter. We will use the word filter for this during the rest of the lecture.

When we placed the filter at a portion of the image, it produced the figure 4 at the first top left cell of the small matrix. As we glide the filter over the green image, it produces a figure in cell of small matrix after performing some calculation. The process continues to produce a new figure in a cell as we glide the filter on our green image. At the end we form a matrix. The process or operation is called convolution.

Now question arises why it is so important. The figures mentioned in the kernel are very important and the people have done Phds in finding these figures.

Remember the size of kernel can be vary and it is not always in dimension of 3 x 3. It can be 2x2 4x4 etc depending upon the situation.

The magenta color matrix produced resultantly is called feature map. So the answer produced by applying the filter on the image is called feature map.

Let us explain it with the effects of applying filter on the image. What happens when we apply the filter on the image in Photoshop? The original image sometimes becomes prominent, dim or sharpen, lines become dominant, engraves and so on. Same is the case of applying the filter here.

The numbers mentioned in the filter or kernel are of much importance. It produces different results when we apply it by changing the numbers in it. 

Look at the picture given above, what would happen, if instead of standing the dog changes its position and sit on the ground.

Answer is that if a filter can detect an image, the location of the face does not matter. There would be no effect, either it is on the top left, or right bottom. The values of the face remain same. So first point is that as the filter is being applied on the whole image so it will detect it and second point is that it will display it on the feature map irrespective of the position of the image.

For example a poster has been pasted on your wall. The poster has multiple scattered pictures of dog, cat, birds and buildings on it. Yours eyes will detect these images irrespective of the position of the images. So it means it does not matter whether the dog is standing or siting, the filter will detect it, although there would be little change in number values.

Remember we will start applying the filter from the top left always.

Another advantage of the filter is that the values in the filter does not multiply with all values of image at a given time as was in the fully connected neural network where the neurons passed the 3 billion at the first layer.

In convolution network, it does not matter whether the image is of size 50x50, 1000x 1000, the processing would be applied as per size of the kernel at a given time, although the duration of gliding the kernel over the large image would be larger than usual. Resultantly the system would not be overwhelmed. So this is the advantage of convolution operation.

In past the filters were named on the name of scientists who invented them after lot of research, for example sobel filter, It is named after Irwin Sobel and Gary Mfilter.

Question here arises, whether we also have to do lot of research for developing filters. Answer is that No. The coevolutionary Neural Network has changed the phenomenon. We learn filters from Coevolutionary Neural Network.

The coevolutionary network will take the image as input and feature map as output, will train itself from back propagation and will explain what would be generated if the said filter is applied. So it is that coevolutionary neural network learns. It learns the filters when  are applied on the images, computer will classify by extracting the useful information from the image.

Here is the 6x6 image. We have a filter that is used to detect the vertical lines. When the filter is applied it has given out put by detect two vertical grey lines and zero is on the edges and 30 in the middle.

We apply minimum and maximum function here. We apply zero to minimum and 255 to maximum. We have applied on part of the image and after multiple attempts it will detect the picture at the end. And at this stage our kernel or filer has been trained. The output of our CNN is the trained kernel and this would be our main contribution.

We have given input image of cat to the convolution network. The convolution network has to respond you whether it cat or non cat.

The convolutional model convolves it and evolved it into a feature map which will pass to next layer. But the feature map would have large images which would have lot of colors and redundant information as well. So, we reduce the size of feature map after convolution stage. We cut into half or do the dimensionality reduction. The process of reducing the size of feature map is called pooling or subsampling.

To apply dimensionality reduction there are two methods. One is more popular and other is less popular. The popular method is called max pooling. Max pooling means if we will pick the larger number out of four numbers.


You can see we have reduced the 4 x 4 feature map into 2 x 2.

·       The max pooling matrix picks the larger number. It is the most popular method of dimensionality reduction.

·       The average pooling matrix would calculate the average of four numbers. It is second preferred method after max pooling.

·       The sum pooling matrix would sum the four numbers.

Now here is the CNN in its full and final form:

The batch size is a number of samples processed before the model is updated. The number of epochs is the number of complete passes through the training dataset. The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset

The activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. The purpose of the activation function is to introduce non-linearity into the output of a neuron.

In order for Gradient Descent to work, we must set the learning rate to an appropriate value. This parameter determines how fast or slow we will move towards the optimal weights. If the learning rate is very large we will skip the optimal solution.

That is, Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would. Softmax is implemented through a neural network layer just before the output layer.


  1. Nice you have explained very well and godd effort ..


Post a Comment

Popular posts from this blog

Topic-18 | Evaluation Metrics for different model

Topic 22 | Neural Networks |

Topic 17 | Linear regression using Sklearn