About Convolution (2D)
Convolution is the core operation in Convolutional Neural Networks (CNNs). A small matrix called a kernel (or filter) slides across a 2D input (such as an image), computing the element-wise product between the kernel and each overlapping patch of the input, then summing the products to produce a single output value. This process generates an output feature map that highlights where certain features (edges, textures, patterns) appear in the input. The key insight of convolution is weight sharing — the same kernel weights are reused across all spatial positions, dramatically reducing the number of parameters compared to a fully connected layer. Convolution was popularized by LeCun et al. (1989) for handwritten digit recognition and remains the backbone of modern computer vision.
Complexity Analysis
- Time Complexity
- O(H × W × K² × C)
- Space Complexity
- O(H × W + K²)
- Difficulty
- intermediate
Key Concepts
Kernel Sliding
The kernel slides across the input one position at a time (stride=1). At each position, it computes the dot product between the kernel and the overlapping input patch. This produces one value in the output feature map.
Weight Sharing
The same kernel weights are applied at every position in the input. This means the network detects the same feature regardless of where it appears — a property called translation equivariance.
Feature Maps
Each kernel produces one feature map. Using multiple kernels (channels) allows the network to detect different features (edges, textures, shapes) simultaneously.
Common Pitfalls
Output size reduction
Without padding, convolution reduces spatial dimensions: output_size = input_size - kernel_size + 1. Use padding='same' to maintain dimensions.
Kernel size must be ≤ input size
The kernel cannot be larger than the input in any dimension. This is a hard constraint of the convolution operation.