Convolution (2D) Visualization — Step-by-Step | Eigenvue

About Convolution (2D)

Convolution is the core operation in Convolutional Neural Networks (CNNs). A small matrix called a kernel (or filter) slides across a 2D input (such as an image), computing the element-wise product between the kernel and each overlapping patch of the input, then summing the products to produce a single output value. This process generates an output feature map that highlights where certain features (edges, textures, patterns) appear in the input. The key insight of convolution is weight sharing — the same kernel weights are reused across all spatial positions, dramatically reducing the number of parameters compared to a fully connected layer. Convolution was popularized by LeCun et al. (1989) for handwritten digit recognition and remains the backbone of modern computer vision.

Complexity Analysis

Time Complexity: O(H × W × K² × C)
Space Complexity: O(H × W + K²)
Difficulty: intermediate

Key Concepts

Kernel Sliding

The kernel slides across the input one position at a time (stride=1). At each position, it computes the dot product between the kernel and the overlapping input patch. This produces one value in the output feature map.

Weight Sharing

The same kernel weights are applied at every position in the input. This means the network detects the same feature regardless of where it appears — a property called translation equivariance.

Feature Maps

Each kernel produces one feature map. Using multiple kernels (channels) allows the network to detect different features (edges, textures, shapes) simultaneously.

Common Pitfalls

Output size reduction

Without padding, convolution reduces spatial dimensions: output_size = input_size - kernel_size + 1. Use padding='same' to maintain dimensions.

Kernel size must be ≤ input size

The kernel cannot be larger than the input in any dimension. This is a hard constraint of the convolution operation.

Prerequisites

Understanding these algorithms first will help:

perceptron