CNN architectures on the DS interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why interviewers still grill you on CNNs

Computer vision is the one DL sub-field that every DS interview loop touches, even at companies that ship zero image models. CNN history is the cleanest narrative arc in deep learning, and it's a fast way to test whether a candidate knows the why behind architecture choices, not just the API surface. A hiring manager at Meta or Anthropic can spend ten minutes on Conv2d and walk away with a sharp signal on your engineering taste.

At junior level you'll get the mechanics — what a kernel does, why we pool, what padding controls. At middle and senior, the bar shifts to evolution and trade-offs: why ResNet beats plain stacking past 30 layers, what compound scaling buys you over arbitrary depth bumps, when a Vision Transformer is the right call and when it's a footgun. Saying "I just use ResNet" without context reads as cargo-cult.

Load-bearing answer: the interviewer wants to hear that you pick architecture based on dataset size, compute budget, and latency target — never out of habit.

The field has shifted twice in the last five years. CNNs lost the throne to ViT around 2021, ConvNeXt clawed back parity in 2022, and Swin-style hybrids dominate most production CV stacks in 2026. Answer with 2019 vocabulary and you're telling the panel you stopped reading papers three jobs ago.

Convolution and pooling fundamentals

A convolution slides a small kernel across the input and computes element-wise product plus sum at each location. Four hyperparameters control behaviour: kernel_size (typically 3×3 or 5×5), stride (1 keeps resolution, 2 downsamples), padding (same keeps spatial dims), and dilation (sparse kernels for wider receptive field without parameter cost).

The output shape follows a formula every interviewer expects you to recite:

H_out = (H_in + 2*padding - kernel_size) / stride + 1

Parameter count for a conv layer is (C_in × kernel_h × kernel_w + 1) × C_out, where the +1 is the bias. This matters because depthwise separable convs in MobileNet and EfficientNet exploit this formula to shave 90% of FLOPS off a dense 3×3.

Pooling is the non-learnable sibling. Max pooling keeps the strongest activation in a window — good for preserving sharp features. Average pooling smooths. Global average pooling at the head of a network replaces the giant fully connected layer of older CNNs and is the modern default since ResNet popularized it in 2015.

Receptive field is the slice of the input that influences a given neuron. It grows with depth, and dilated convolutions blow it up cheaply without inflating parameters — a trick that powers semantic segmentation networks like DeepLab.

AlexNet, VGG, Inception

The pre-ResNet era produced three architectures interviewers still namecheck. AlexNet (2012) was the ImageNet breakout — 5 conv plus 3 FC layers, ReLU, dropout, dual-GPU. Citation value only in 2026; nobody ships it.

VGG-16/19 (2014) proved that uniform stacks of 3×3 convs plus 2×2 max-pool, made deep enough, beat exotic kernels. The lesson — depth matters more than kernel cleverness — outlived the architecture. Downside: 138M parameters and brutal inference cost.

Inception / GoogLeNet (2014) introduced parallel 1×1, 3×3, 5×5 convs plus max-pool, concatenated. The 1×1 conv served as a bottleneck before the expensive 5×5. Twenty-two layers, only 7M parameters. Inception v3 added factorized convs, label smoothing, batch norm; still shows up as a baseline on small custom datasets.

Architecture Year Layers Params Modern role
AlexNet 2012 8 60M Historical reference only
VGG-16 2014 16 138M Rarely; teaching tool
Inception v3 2015 48 24M Occasional baseline
ResNet-50 2015 50 25M Default CV baseline
EfficientNet-B3 2019 12M Mobile and edge
Swin-B 2021 88M Production default
ConvNeXt-B 2022 89M CNN parity with ViT

ResNet and residual connections

Past 30 layers, naive stacks of conv-BN-ReLU stop training. Not from overfitting — the training error itself goes up. He et al. (2015) diagnosed this as a gradient flow problem and introduced the skip connection:

y = F(x) + x

Here F is a small block of conv + batch norm + ReLU. The network learns the residual — the delta from identity — which is easier to optimise than the full mapping. This trick is so general it now lives inside every Transformer block too.

For deeper variants (ResNet-50/101/152) the bottleneck block compresses channels with 1×1, runs the 3×3, then expands back with 1×1:

1x1 conv (reduce C) -> 3x3 conv -> 1x1 conv (restore C) + skip

Interview summary of why this still matters in 2026: ResNet-50 is the universal baseline, pre-trained ImageNet weights are everywhere, fine-tuning is cheap, and skip connections seeded the architectural DNA of DenseNet, U-Net, and every Transformer encoder. Recruiters at DoorDash or Stripe expect you to start any new CV task with "I'd try a ResNet-50 baseline first" and have a reason if you don't.

EfficientNet and compound scaling

Tan and Le (2019) asked a simple question: when you scale a CNN, you can grow depth, width, or input resolution — but how should you balance them? Their answer was compound scaling, tying all three to one parameter φ:

depth: d = alpha^phi
width: w = beta^phi
resolution: r = gamma^phi
alpha * beta^2 * gamma^2 ~ 2

EfficientNet-B0 through B7 are the same network at φ = 0, 1, ... 7. The backbone block is MBConv — mobile inverted bottleneck with depthwise separable convs and squeeze-and-excitation. At equal accuracy to ResNet-50, EfficientNet-B3 uses roughly half the FLOPS and one-third the parameters.

Gotcha: parameter count is not latency. EfficientNet-B0 has 5M params but its depthwise convs are memory-bandwidth-bound on GPUs, so on an A100 a ResNet-50 with 25M params often serves requests faster. Always benchmark before believing the parameter chart.

In 2026 EfficientNet remains the default for mobile and edge workloads — Apple's on-device vision pipelines and most Android camera ML use MBConv-style backbones. On servers with abundant GPU, ViT and Swin tend to win on accuracy at comparable cost.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

ViT, Swin and ConvNeXt

The Vision Transformer (Dosovitskiy et al., 2020) took the NLP Transformer encoder and pointed it at images. Recipe: chop the input into non-overlapping 16×16 patches, linearly embed each into a token, add positional encodings, run a stack of standard Transformer blocks. No convolution anywhere.

The key trade-off is inductive bias. CNNs hard-code locality and translation equivariance — ViT learns these from data. That makes ViT data-hungry: the original paper used JFT-300M (300 million labeled images) to beat CNNs. Fine-tuning a pre-trained ViT works great; training from scratch on 50k images is malpractice.

Swin Transformer (2021) patched the practical issues: hierarchical feature maps like a CNN, attention restricted to local windows, shifted windows between layers to mix information across boundaries. The result is linear complexity in image size instead of quadratic, and a backbone that drops into detection and segmentation heads without surgery. In 2026, Swin-B and Swin-L are default backbones in most production CV stacks.

ConvNeXt (2022) is the empire striking back: a pure CNN redesigned with Swin's vocabulary (large 7×7 kernels, LayerNorm instead of BN, GELU instead of ReLU, fewer activations per block). Matches Swin on ImageNet accuracy with no attention — a great answer to "is CNN dead?"

Picking the right backbone

This is the senior-level question on most DS loops. There's no single right answer, but there is a defensible decision tree based on dataset size, compute, and deployment target.

Dataset size Compute Recommended backbone
< 50k images Any Pre-trained ResNet-50, freeze early layers
100k - 1M Single GPU EfficientNet-B3 or ResNet-50
100k - 1M Multi-GPU Swin-T or Swin-B
1M+ Cluster ViT-L, Swin-L, ConvNeXt-L
Mobile / edge < 1G FLOPS MobileNetV3, EfficientNet-B0, MobileViT
Real-time detection GPU YOLOv8/v9 backbone, RT-DETR

For small datasets the inductive bias of CNNs still wins, and a pre-trained ResNet with fine-tuned head is the strongest baseline you can ship in an afternoon. For 1M+ images, self-supervised pre-training (DINO, MAE) on top of a ViT or Swin nearly always adds 2-4 accuracy points.

Real-time object detection in 2026 still leans CNN-heavy because latency budgets favor the cleaner memory access patterns of conv stacks over attention.

Common pitfalls

Picking VGG in 2026 is the canonical red flag. Junior candidates sometimes name it because that's what their bootcamp covered, but the network is 138M parameters of waste — ResNet-50 beats it on every axis with one-fifth the storage. If you mention VGG, only do so in the context of "I'd use ResNet-50 instead because…"

Training a ViT from scratch on a small dataset is the senior-candidate trap. Because ViT lacks the locality bias of CNNs, it needs hundreds of thousands of images minimum, and ideally millions, before it stops underfitting. The fix is always to start from an ImageNet- or JFT-pretrained ViT and fine-tune. If the interviewer pushes back on data scarcity, your answer should be "I'd use a CNN-based architecture, or a pre-trained ViT — not a ViT trained from scratch."

Skipping pre-trained weights is another common error, especially among candidates from tabular ML. Unless your domain is truly alien (medical microscopy, satellite spectra in unusual bands), ImageNet pre-training is free accuracy. Loading torchvision.models.resnet50(weights='IMAGENET1K_V2') and fine-tuning beats training from scratch on virtually any natural-image task with under a million examples.

Forgetting augmentation is a quiet killer. The modern stack — random crop, horizontal flip, color jitter, RandAugment or AutoAugment, MixUp or CutMix — typically adds 2-5 points to top-1 accuracy. Designing experiments without specifying augmentations gets you marked down for incomplete engineering.

Mishandling batch norm during fine-tuning trips up even experienced practitioners. When you fine-tune with a tiny batch size (8 or 16), BN running statistics get noisy. Fix is either to freeze BN layers (bn.eval()) or switch to GroupNorm. This is also why ConvNeXt and Swin both moved to LayerNorm.

Treating parameter count as latency is the EfficientNet trap. On a GPU, FLOPS and memory bandwidth dominate; on CPU or mobile NPU, parameter count and quantization-friendliness matter more. Always benchmark on the target hardware. And using a fully connected layer at the head is a 2014-era reflex — global average pooling plus a single linear projection is the modern default.

If you want to drill questions like these every day with hint-and-explain feedback, NAILDD is launching with 1500+ DS interview problems across exactly this pattern.

FAQ

What is a 1×1 convolution actually doing?

A 1×1 conv has no spatial extent — it only mixes channels at each pixel. It's used for cheap channel bottlenecks (Inception, ResNet) and cross-channel feature mixing without the cost of a 3×3. The MBConv block in EfficientNet uses 1×1 expansion-projection pairs around a depthwise 3×3 for the same reason.

Batch norm versus layer norm — which when?

BN normalizes across the batch dimension, computing mean and variance over all samples in the mini-batch. It works for CNNs trained with large batches and is the historical default. LN normalizes across the feature dimension within a single sample, batch-size-invariant and well-suited to Transformers. The 2022-onward trend uses LayerNorm or GroupNorm even in CNNs (ConvNeXt) to sidestep small-batch issues.

How do I pick a learning rate for a CNN?

For training from scratch, the modern recipe is linear warmup over the first 5-10% of steps, then cosine decay to near zero. Peak learning rate scales with batch size — roughly lr = 0.1 × batch_size / 256 for SGD on ImageNet-style tasks. For fine-tuning, drop the peak by one to two orders of magnitude (1e-4 to 1e-5 range) with discriminative learning rates where lower layers train slower than the head.

Reflect padding or zero padding?

Zero padding is the default and works for the vast majority of natural-image tasks. Reflect padding helps when the model is sensitive to edge artifacts — denoising, super-resolution, scientific imaging. Honest interview answer: start with zero, switch if you see edge artifacts in qualitative analysis.

ResNet-50 or EfficientNet-B0 for production?

On mobile or edge with strict FLOPS budgets, EfficientNet-B0 is the better pick — 5× fewer FLOPS at equivalent accuracy. On a server with GPU, ResNet-50 is often faster in practice because dense convs map better to GPU memory layouts than depthwise separable convs, and the ecosystem (CUDA kernels, TensorRT plugins) is more mature. Benchmark on your hardware before deciding.

Is the answer to this interview always "use a Transformer"?

No, and saying so flags you as someone who reads blog posts but not benchmarks. For datasets under 1M images, CNNs with strong pre-training and augmentation routinely match or beat ViT. For latency-sensitive real-time detection, CNNs still dominate. The sophisticated answer: architecture choice is dictated by data scale, compute budget, and latency target.