With all the hustle and bustle with GPT-3, Chat GPT and recent results in image recognition, I realized that it’s about time to learn something in this realm not to fall behind that much. Hence, this time I’m going to present the most basic Convolutional Neural Network I could imagine that still exhibits some interesting traits and can be a reasonable ‘Hello World’ for everyone that would like to learn the basics.
This is what is going to happen
- We will lay out the problem
- Then we’re going to present and analyze PyTorch-based CNN code that solves this problem
- Finally, you’ll have a chance to solve this problem on a sheet of paper to understand what our Convolutional Neural Network actually does
Disclaimer: I am no expert in the realm of machine learning nor neural networks. This is just me having some fun with this incredibly interesting topic, follow me if you want.
Disclaimer 2: I will be using term ‘neural network’ whereas we actually train a single layer, this might be misleading, but I want you to be focused on the process and the fun, not the names.
The case
Our neural network will be dealing with image recognition on 3×3 pixel-sized image. We’re going to train this network to recognize two types of crosses, namely:


The code
I will print here the whole code as a single block with comments to make it clear that I don’t have anything besides this code – nothing is hidden and nothing else is required to run this network – all you have to do is paste the following code as a python script, have PyTorch installed and run it.
Pay attention to the evaluation part, where the input image will be purposefully missing one piece to verify network performance.
import random
import torch
def generate_type_0_image():
return torch.tensor([[[0, 1, 0],
[1, 1, 1],
[0, 1, 0]]], dtype=torch.float32)
def generate_type_1_image():
return torch.tensor([[[1, 0, 1],
[0, 1, 0],
[1, 0, 1]]], dtype=torch.float32)
def generate_image():
# decide randomly which image to generate
image_type = random.randint(0, 1)
if (image_type == 0):
image = generate_type_0_image()
else:
image = generate_type_1_image()
# return tuple of image and its type
# so that the network can learn if it recognized it correctly
return image, image_type
class NeuralNetwork(torch.nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
# define single convolutional layer to 'scan' the image
self.conv = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1)
# define single pooling layer that 'wraps up'
# what convolutional layer has learned
self.pool = torch.nn.AvgPool2d(3)
# this is the place where image is processed
# returned 'x' is a single floating point number
# representing image type (0.0 or 1.0)
# the network is going to learn to predict it
def forward(self, x):
x = self.conv(x)
x = self.pool(x)
return x
# model is intialized
model = NeuralNetwork().to('cpu')
# we pick some 'default' loss function
loss_fn = torch.nn.MSELoss()
# we pick some popular optimizer
optimizer = torch.optim.Adadelta(model.parameters())
# model is set in training mode
model.train()
# interestingly, I need 5000 images before the network
# is reliable in this simple task
for i in range(5000):
# each time a random 'image' is generated
image, image_type = generate_image()
# wrap image type as tensor of proper shape
# (I need these additional dimensions to match
# CNN output so that I can calculate loss)
expected_output = torch.tensor([[[image_type]]], dtype=torch.float32)
# our model predicts if 'image' is of type '0' or '1'
prediction = model(image)
# loss function calculates how much inaccurate our model was
loss = loss_fn(prediction, expected_output)
# I print loss to observe how it goes down during learning process
print('Loss at', i, '=', loss)
# this kind of 'resets' optimizer before next iteration
optimizer.zero_grad()
# we tell our network to adjust depending on how bad it went this time
loss.backward()
# we take another step in the learning process
optimizer.step()
# I assume the model is ready and switch it to evaluation mode
model.eval()
# now it's the fun part
# let's create a tensor that resembles type '0' image but misses something
almost_type_0_image = torch.tensor([[[0, 1, 0],
[1, 1, 0],
[0, 1, 0]]], dtype=torch.float32)
# we'd expect the network to predict that this is '0' type image
# (actually, ideally it should print 0.0)
# so let's find out how good is it
print('Almost type 0 image is predicted to be', model(almost_type_0_image))
# let's do a similar thing to type '1' image
# now, ideally it should print 1.0
almost_type_1_image = torch.tensor([[[1, 0, 0],
[0, 1, 0],
[1, 0, 1]]], dtype=torch.float32)
print('Almost type 1 image is predicted to be', model(almost_type_1_image))
# finally, let's print what our network has actually learned
# or, in other words, what it really is
# we're going to need it in the next paragraph
print('Convolutional layer weight matrix is', model.conv.weight)
print('Convolutional layer bias is', model.conv.bias)
After running this code, your console should print something like:
Loss at 1 = tensor(0.1026, grad_fn=<MseLossBackward0>)
Loss at 2 = tensor(1.5652, grad_fn=<MseLossBackward0>)
Loss at 3 = tensor(1.5187, grad_fn=<MseLossBackward0>)
...
Loss at 4997 = tensor(8.7499e-06, grad_fn=<MseLossBackward0>)
Loss at 4998 = tensor(1.5388e-05, grad_fn=<MseLossBackward0>)
Loss at 4999 = tensor(2.6441e-05, grad_fn=<MseLossBackward0>)
Almost type 0 image is predicted to be tensor([[[0.3761]]], grad_fn=<AvgPool2DBackward0>)
Almost type 1 image is predicted to be tensor([[[1.1388]]], grad_fn=<AvgPool2DBackward0>)
Convolutional layer weight matrix is Parameter containing:
tensor([[[[-1.5447, -0.6216, -1.7912],
[-0.8678, 1.9076, -0.7804],
[-1.2586, -0.4520, -1.7062]]]], requires_grad=True)
Convolutional layer bias is Parameter containing:
tensor([2.2435], requires_grad=True)
Interpretation
Based on approximately 2500 type-0 and 2500 type-1 images we trained our simple network to distinguish both situations to the extent that even if the input image is not perfect, i.e., is missing something, the prediction is still correct:
- For ‘almost type 0 image’ the result is 0.3761, which is way closer to 0.0 than to 1.0
- For ‘almost type 1 image’ the result is 1.1388, which is way closer to 1.0 than to 0.0
Yeah, I know I could have used additional 'Sigmoid' layer to spread results between 0 and 1, but I wanted to cut off event that to simplify the example. Especially the following paragraph.


What actually happened?
Convolutional layer acts like a window that scans the ‘image’. In each pixel of image, convolution is applied so that, in our case, 3×3 mask is used to multiply image ‘pixels’ times convolutional layer weights. On top of it, a constant bias (floating point number) is added everywhere. Additional padding of zeros is applied to ‘spread’ our image just enough for the convolutional layer to be applied without any conditional approach.
So, in each point of the second layer we have kind of a ‘view’ of an image part. All of these ‘views’ are averaged by the pooling layer to achieve a single floating point number that, the closer is to 0.0, the more it is a ‘type-0’ image and, the closer it is to 1.0, the more it is a ‘type-1’ image.

In the picture I present you a real example that I calculated by hand both in a spreadsheet and with a calculator to make sure I the network works the way I think it works, and I believe it really does! You can relate all these numbers to the console output from the paragraph above and, if you want, you can repeat my calculations to find out if I tell the truth! Thank you for this journey with my first steps in CNNs!