This repository provides a guide for building Convolutional Neural Networks (CNNs) in PyTorch, aimed at beginners who want to understand how CNNs work and how to implement them. It combines theoretical explanations of key concepts from Deep Learning, such as the network architecture, Cross-Entropy Loss, and the Adam optimizer, with code implementation that showcase how these components come together in practice.
CNNs are widely used in Computer Vision tasks, such as image classification, object detection, and image generation.
This repository is built for learning purposes and helps beginners get started with coding neural networks and understanding their key components.
A Convolutional Neural Network (CNN) is a type of feed-forward network that learns to perform tasks such as classification by optimizing its parameters (features). For each input, the network performs a forward pass, computing the outputs of each layer and storing intermediate activations. The loss is then calculated using the chosen loss function. During Backpropagation, the network updates its parameters using a gradient descent algorithm, moving toward a (local) minimum of the loss.
The network in this guide, referred to as the “Simple CNN”, consists of 6 layers: 2 convolutional layers, 2 pooling layers, and 2 fully-connected layers. It also employs dropout and batch normalization to improve generalization and training stability.
In Convolutional Neural Network there are several types of layers, we will discuss the types that are relevant to our SimpleCNN model.
The Fully Connected (FC) layer consists of weights and biases, where every neuron in the output layer is connected to every neuron in the input layer through these weights.
(In the image, the input layer is shown in blue, the output layer in red, and the connecting arcs represent the weights.)
In this structure, each output neuron is influenced by all input neurons according to the corresponding weights. After this linear combination, a non-linear activation function is applied. The goal in this layer is to optimize both the weights and the biases. The formula below shows how to compute the
A network consisting solely of Fully-Connected Layers is called Multilayer perceptron.
The convolutional layer is a fundamental building block of a CNN. It performs a dot product between two matrices: one is the kernel (or filter), which contains learnable parameters, and the other is a portion of the layer’s input. The parameters to optimize in this layer are the kernels themselves.
For a more intuitive explanation of convolution, the 3Blue1Brown provides an excellent visual guide.
During a forward pass, each filter scans the input according to its specifications (stride, padding etc.). At each position, a convolution (cross-correlation) operation is applied to the corresponding portion of the input to produce a single output value (as illustrated in the image). For multi-channel inputs, the operation is applied across all channels of the input, and the results are summed to produce the final output of the filter.
The size of the output can be calculated as follows:
The number of output channels is the number of filters in the layer.
A pooling layer in Convolutional Neural Networks (CNNs) is used to reduce the spatial dimensions (height and width) of feature maps while retaining the most important information. The pooling operation applies a kernel that slides over the input, producing a single output value for each region, depending on the type of pooling (e.g., Max-Pooling or Average-Pooling, as illustrated in the figure below). The output size along each dimension is determined by the input size, kernel size, stride (step size), and padding (if applied).
A neural network is a model used for function approximation and estimation. When non-linear activation functions are applied, even a simple two-layer network can approximate any continuous function, as stated by the Universal approximation theorem. To enhance the expressive power of neural networks, non-linear activation functions are applied after fully connected layers and convolutional layers, since convolution itself is a linear operation. These activations introduce non-linearity, enabling the network to capture complex patterns in the data.
Common examples of activation functions:
- Sigmoid -
$\sigma(x) = {1 \over {1+e^{-x}}}$ - Hyperbolic Tangent -
$tanh(x) = {{e^x - e^{-x}} \over {e^x + e^{-x}}}$ - Rectified Linear Unit (ReLU) -
$ReLU(x) = max(0,x)$ - Leaky rectified linear unit (Leaky ReLU) -
$LReLU(x) = \begin{cases}0.01x \; & if & x \le 0 \\\ x \; & if & x > 0\end{cases}$ - Softmax -
$Softmax(x)_{i} = \frac{e^{x_i}}{\sum_{j=0} e^{x_j}}$
The loss measures the discrepancy between the predicted output of a model and the true target output according to a chosen criterion. It quantifies how well the model performs on a single example, reflecting the “error” or “cost” associated with that prediction. A loss function is a mathematical formulation that maps predictions and true targets to a single real number representing this cost. It serves as a guiding signal for learning, indicating how the model parameters should be updated to minimize the loss.
Loss functions are task-specific measures of the discrepancy between predicted and true outputs. For example, regression tasks often use Mean Squared Error (MSE) or Mean Absolute Error (MAE), while classification tasks commonly use Cross-Entropy or Hinge Loss. They can also encode penalties for particular mistakes, such as misclassifying certain classes or emphasizing outliers. Minimizing the loss over the dataset forms the optimization objective in machine learning and provides the signal used to compute gradients and update model parameters during training.
Common examples of loss functions:
- Mean Squared Error (MSE) -
$MSE = \frac{1}{N} \sum_{i=0} (y_i - t_i)^2$ - Mean Absolute Error (L1 Loss) -
$MAE = \frac{\sum_{i=0} |y_i - t_i|}{N}$ - Mean Bias Error -
$MBE = \frac{\sum_{i=0} (y_i - t_i)}{N}$ - Hinge (SVM) -
$H_i = \sum_{j\neq y_i} max(0, s_j - s_{y_j}+1)$ - Cross-Entropy -
$CE = -\frac{1}{N} \sum_{i=0} y_i*\log{t_i}$
Mathematical optimization is the process of selecting the best element from a set of feasible options according to a given objective. An optimization problem is therefore the task of finding the solution that yields the minimum or maximum value of an objective function. Such problems can involve either continuous variables or discrete variables, depending on the context.
In machine learning, optimization refers to the process of adjusting a model’s parameters so that its predictions better match the target outputs. This is typically framed as minimizing a loss function, which quantifies the error between predicted and true values. Because modern models have millions of parameters and highly non-linear loss surfaces, analytical solutions are infeasible. Instead, iterative algorithms like Gradient Descent and its variants are used to gradually update parameters in the direction that reduces the loss.
Gradient descent is a first-order iterative optimization algorithm that uses the gradient of the loss function to guide parameter updates. Starting from an initial set of parameters, the algorithm computes the gradient of the loss with respect to each parameter and then moves the parameters in the opposite direction of the gradient, since this is the direction of steepest descent. The magnitude of each update is controlled by the learning rate (η), which determines how large a step is taken toward reducing the loss. Through repeated updates, the parameters gradually converge toward a local (or sometimes global) minimum of the loss function.
In the forward pass, the input propagates through the network to produce an output and compute the loss. In the backward pass (backpropagation), this loss is propagated backward through the network to determine how much each parameter contributed to the error. Using the chain rule, the gradients of the loss with respect to all weights and biases are calculated layer by layer. These gradients indicate how the parameters should change to reduce the loss. The Gradient Descent algorithm then updates the parameters in the opposite direction of the gradients. Repeating this cycle gradually improves the model by minimizing the loss function.
3Blue1Brown offers a clear and intuitive visualization of backpropagation.
The parameter update rule for the
Where
-
$\eta$ is the learning rate. -
$L$ is the Loss function. -
$\nabla L$ is gradient of the loss with respect to parameter$w_i$
The minimum conditions for applying gradient-based optimization are that the loss function must be Differentiable with respect to the model’s parameters and that gradients can be computed efficiently. Differentiability allows the use of the Chain rule, implemented through backpropagation, to propagate errors from the output layer back through the network. This ensures that each parameter receives an update signal that guides it toward reducing the loss.
Common variants of Gradient Descent:
- Batch Gradient Descent – uses the whole dataset per step.
- Stochastic Gradient Descent (SGD) – updates per single sample.
- Mini-batch Gradient Descent – compromise using small batches.
- Momentum – accelerates updates in consistent directions.
- Adagrad – adaptive learning rate based on past gradients.
- RMSProp – scales learning rates using moving averages.
- Adam – combines Momentum and RMSProp, widely used in deep learning.
Regularization refers to a set of techniques used to prevent a machine learning model from overfitting the training data, improving its generalization to unseen data. It works by constraining or penalizing the model’s complexity, encouraging simpler solutions that are less sensitive to noise in the data.
In our model, Simple CNN, we use Dropout and Batch Normalization methods.
Dropout is a regularization technique where, during training, a fixed percentage of neurons (e.g. 50%) are randomly set to zero in each forward pass, preventing co-adaptation of neurons. This prevents over-reliance on specific neurons and encourages redundancy and robustness.
At inference time, all neurons are active, and their outputs are scaled to match the expected value during training.
During inference, all units are used as-is:
Batch Normalization aims to stabilize and accelerate training by ensuring each channel’s activations have consistent statistics across mini‑batches. This method normalizes each feature channel’s activations to zero mean and unit variance over a mini-batch thereby It reduces internal covariate shift and can have a slight regularizing effect (due to batch noise).
For a layer’s inputs
Then we scale (
where
A dataset is a collection of data used to train, validate, and test machine learning models. It provides the examples from which the model learns patterns, relationships, or features relevant to a specific task, such as classification or regression. The quality, size, and diversity of the dataset directly influence the model’s performance and generalization ability, making dataset preparation and selection a critical step in any machine learning pipeline.
MNIST is a classic dataset of handwritten digits and is often referred to as the “hello world” of computer vision and machine learning. By providing a simple, standardized set of images paired with labels, it allows both newcomers and experts to quickly prototype and benchmark classification algorithms, explore feature learning, and compare new techniques against a familiar baseline.
This network is trained on the MNIST dataset, which consists of grayscale images of handwritten single-digit numbers (0–9). The task for the network is to classify each input image into the correct digit class. The dataset contains 70,000 images, with 60,000 images used for training and 10,000 images for testing.
For more information on the MNIST Dataset.
Our Model is consist of 6 layers:
- Convolution Layer with a kernel size of 5x5, and ReLU activation function.
- Max-pool Layer with a kernel size of 2x2.
- Convolution Layer with a kernel size of 5x5and ReLU activation function..
- Max-pool Layer with a kernel size of 2x2.
- Fully-connected Layer with input layer of 1024 and output layer of 512 and ReLU activation function.
- Fully-connected Layer with input layer of 512 and output layer of 10 (classes) and Softmax activation function.
The Simple CNN also use methods to accelerate and stablize the convergence of the network training, and avoid overfitting. After the second layer and fourth layer (Max-pool) the Simple CNN applies Dropout, and after the first layer and the third layer (Convolution) it applies Batch-Normalization, before the activation.
The Simple CNN is implemented with pytorch. In order to implement the network layers and methods pytorch module torch.nn is being used. Every Layer/method apart of the fully connected gets an input of 4-dimentions (N,C,H,W), were N is the batch size, C is the number of the channels and H,W are height and width respectively, the resolution of the images. There are multiple kinds of layers, methods and function that can be used from this module, and for the Simple CNN network we used:
- Conv2d - Applies a 2D convolution over an input signal composed of several input planes.
- MaxPool2d - Applies a 2D max pooling over an input signal composed of several input planes.
-
Linear - Applies a linear transformation to the layer's input,
$y = xA^T+b$ . In case of 4D input we flatten it to 2D, (N,H) / (N,C·H·W) with the same notations above. - Dropout - During training, randomly zeroes some of the elements of the input tensor with a given probability p using samples from a Bernoulli distribution. Each channel will be zeroed out independently on every forward call.
-
BatchNorm2d - Applies Batch Normalization over a 4D input, sclicing through C (channel dimesion) and computing mean (
$\mu$ ) and variance ($\sigma^2$ ) on (N,H,W) slice. Using that statistics normalizing each slice.
class SimpleCNN(nn.Module):
"""
A simple Convolutional Neural Network (CNN) for MNIST classification.
Architecture:
- 2 Convolutional layers with ReLU and Batch Normalization.
- 2 Max Pooling layers.
- 2 Fully Connected (FC) layers.
- Regularization:
- 2 Dropout.
- 2 Batch Normalization.
- Note: No explicit Softmax (applies by nn.CrossEntropyLoss).
"""
def __init__(self, num_classes = 10):
super(SimpleCNN, self).__init__()
# Convolutional layers
self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=5)
self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=5)
# Max-Pooling layers
self.max1 = nn.MaxPool2d(kernel_size=2, stride=2)
self.max2 = nn.MaxPool2d(kernel_size=2, stride=2)
# Fully-Connected layers
self.fc1 = nn.Linear(in_features=64 * 4 * 4, out_features=512)
self.fc2 = nn.Linear(in_features=512, out_features=num_classes)
# Dropout
self.dropout1 = nn.Dropout(p=0.45)
self.dropout2 = nn.Dropout(p=0.35)
# Batch Normalization
self.batch1 = nn.BatchNorm2d(num_features=32)
self.batch2 = nn.BatchNorm2d(num_features=64) for inputs, labels in train_loader:
# Reset gradients
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
# Compute loss
loss = criterion(outputs, labels)
# Backpropagation
loss.backward()
# Update parameters
optimizer.step()# Initialize model, loss function and optimizer
cnn_model, loss_fn, adam_optimizer, h_device = _setup_model_for_training(num_class, learning_rate)
# Initialize MNIST data loaders
train_loader, val_loader, test_loader = get_mnist_dataloaders(batch_size, validation_split)
# Train & Validation
train_losses, validation_losses = train_model(cnn_model, loss_fn, adam_optimizer, train_loader, val_loader, h_device)
# Test
test_accuracy, test_loss = evaluate_model(cnn_model, loss_fn, test_loader, h_device)
print(f"\nTest Loss: {test_loss:.4f},Test Accuracy: {test_accuracy:.2f}%")
# Plot Loss
plot_training_losses(train_losses, validation_losses)Cross Entropy Loss is widely used for classification tasks, as it measures the difference between the predicted probability distribution and the true distribution. Given a predicted probability vector
This loss penalizes confident incorrect predictions more heavily than less certain ones, encouraging the model to assign higher probabilities to the correct classes. Minimizing cross-entropy effectively maximizes the likelihood of the correct labels under the model’s predicted distribution.
Adam (Adaptive Moment Estimation) is a widely used optimization algorithm in machine learning. It combines the benefits of Momentum and RMSProp, maintaining running estimates of both the mean and the uncentered variance of gradients to adaptively adjust the learning rate for each parameter. By using these adaptive estimates, Adam can converge faster and more reliably on complex models, handle noisy gradients, and often requires less manual tuning of the learning rate compared to standard stochastic gradient descent. Its adaptive nature makes Adam particularly effective for large-scale problems and deep neural networks, where gradients can vary significantly across parameters.
-
$\theta_t$ : parameters at time step t. -
$\beta_1, \beta_2$ : exponential decay rates for moment estimates. -
$\alpha$ : learning rate. -
$\epsilon$ : small constant to prevent division by zero. -
$\lambda$ : weight decay coefficient.
-
Compute gradients:
$$g_t = \nabla_\theta J(\theta_t)$$ -
Update moment estimates:
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad;\quad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ -
Bias correction:
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad;\quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$ -
Parameter update:
$$\theta_{t+1} = \theta_t - \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
The model performances are evaluated mainly by the Loss. Loss is the key measure of how far predictions deviate from the targets, driving the optimization process to adjust model parameters and minimize this error. Training loss reflects performance on the data used for learning, while validation loss measures performance on unseen data to assess generalization; a widening gap between them often indicates overfitting.
The Back Propagation Method for CNN
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Improving neural networks by preventing co-adaptation of feature detectors
