Skip to content

Convolution (2D)

Category: Deep Learning
Difficulty: Intermediate
Time Complexity: O(H × W × K² × C)
Space Complexity: O(H × W + K²)

Convolution is the core operation in Convolutional Neural Networks (CNNs). A small matrix called a kernel (or filter) slides across a 2D input (such as an image), computing the element-wise product between the kernel and each overlapping patch of the input, then summing the products to produce a single output value. This process generates an output feature map that highlights where certain features (edges, textures, patterns) appear in the input. The key insight of convolution is weight sharing — the same kernel weights are reused across all spatial positions, dramatically reducing the number of parameters compared to a fully connected layer. Convolution was popularized by LeCun et al. (1989) for handwritten digit recognition and remains the backbone of modern computer vision.

{
"input": [
[
1,
2,
3,
0
],
[
4,
5,
6,
1
],
[
7,
8,
9,
2
],
[
0,
1,
2,
3
]
],
"kernel": [
[
1,
0
],
[
0,
-1
]
]
}
{
"input": [
[
1,
2,
3,
0
],
[
4,
5,
6,
1
],
[
7,
8,
9,
2
],
[
0,
1,
2,
3
]
],
"kernel": [
[
1,
0
],
[
0,
-1
]
]
}
{
"input": [
[
0,
0,
0,
0,
0
],
[
0,
1,
1,
1,
0
],
[
0,
1,
1,
1,
0
],
[
0,
1,
1,
1,
0
],
[
0,
0,
0,
0,
0
]
],
"kernel": [
[
-1,
-1,
-1
],
[
-1,
8,
-1
],
[
-1,
-1,
-1
]
]
}
function convolve2D(input, kernel):
H, W = dimensions(input)
kH, kW = dimensions(kernel)
outH = H - kH + 1
outW = W - kW + 1
output = zeros(outH, outW)
for row in range(outH):
for col in range(outW):
sum = 0
for kr in range(kH):
for kc in range(kW):
sum += input[row+kr][col+kc] * kernel[kr][kc]
output[row][col] = sum
return output

The kernel slides across the input one position at a time (stride=1). At each position, it computes the dot product between the kernel and the overlapping input patch. This produces one value in the output feature map.

The same kernel weights are applied at every position in the input. This means the network detects the same feature regardless of where it appears — a property called translation equivariance.

Each kernel produces one feature map. Using multiple kernels (channels) allows the network to detect different features (edges, textures, shapes) simultaneously.

  • Output size reduction: Without padding, convolution reduces spatial dimensions: output_size = input_size - kernel_size + 1. Use padding=‘same’ to maintain dimensions.
  • Kernel size must be ≤ input size: The kernel cannot be larger than the input in any dimension. This is a hard constraint of the convolution operation.

Q1: What is the output size of convolving a 5×5 input with a 3×3 kernel (no padding, stride 1)?

  • A) 5×5
  • B) 4×4
  • C) 3×3
  • D) 2×2
Show answer

Answer: C) 3×3

Output size = input_size - kernel_size + 1 = 5 - 3 + 1 = 3. So the output is 3×3.