GPT architecture in a data science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What an interviewer actually wants

The question "walk me through GPT architecture" sounds open-ended, but the rubric behind it is narrow. The interviewer wants to hear five load-bearing claims in roughly this order: decoder-only stack, causal self-attention, Pre-LN with RMSNorm, rotary positional embeddings (RoPE), and SwiGLU in the feed-forward block. If you also bring up grouped-query attention (GQA) and a rough sense of parameter scaling, you are now answering at staff-DS level.

A good answer is not a monologue. Sketch a Transformer block on the whiteboard, name each component, and explain why the modern stack diverges from the 2017 paper. Choices that matter: why Pre-LN beats Post-LN past 24 layers, why RoPE generalizes to longer context than learned positional embeddings, and why GQA is the default in production-scale models from 30B parameters up.

Load-bearing trick: when an interviewer asks "what changed from the original Transformer to a modern LLM," answer in five bullets — decoder-only, Pre-LN, RMSNorm, RoPE, SwiGLU — and only then add GQA and the parameter math. That sequence matches the actual evolution.

Decoder-only and causal masking

GPT is a decoder-only Transformer: there is no encoder block, no cross-attention, just a stack of identical decoder layers. Each layer has two sublayers — a causal self-attention block and a position-wise feed-forward network — both wrapped in residual connections and layer norm.

The model is trained on next-token prediction. Given tokens [t1, t2, t3], the network produces a distribution P(t4 | t1, t2, t3), and the loss is cross-entropy against the actual t4. Because the same forward pass produces predictions for every position in parallel, you compute the loss at every position in a single sequence — that is what makes training data-efficient.

# Causal mask: token at position i can attend to positions 0..i only.
# Shape: (seq_len, seq_len), 1 = allowed, 0 = masked out.
import torch

def causal_mask(seq_len: int) -> torch.Tensor:
    return torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool))

# Example for seq_len=4:
# [[1, 0, 0, 0],
#  [1, 1, 0, 0],
#  [1, 1, 1, 0],
#  [1, 1, 1, 1]]

The mask is added to the attention scores before softmax (as -inf in disallowed positions), so the softmax assigns zero weight to future tokens. This prevents the future-leak problem: without the mask, the model would see the answer it is supposed to predict and learn nothing.

An interviewer will sometimes ask whether the mask is needed at inference. The honest answer is: you can skip it if you generate one token at a time, but in practice every implementation keeps it for code symmetry between training and serving.

Layer normalization placement

This is where the modern stack diverges sharply from the 2017 paper. Three options exist, and a senior candidate names all three.

Post-LN was the original choice. The norm sits after the residual addition. For shallow networks (≤12 layers) it works, but at 24+ layers the gradient through the residual path explodes or vanishes depending on initialization, and you need a brittle warmup schedule to train at all.

Pre-LN moves the norm to before the attention and FFN sublayers. The residual stream is now norm-free, which means gradients flow cleanly through the skip path. This is the single change that let GPT-3 train at 96 layers without exotic tricks. Every modern open-weight model uses Pre-LN.

RMSNorm drops the centering step of layer norm — it only rescales by root-mean-square, no mean subtraction. The accuracy difference is negligible, but RMSNorm is roughly 10-20% faster than full LayerNorm because it has one fewer reduction. Llama, Mistral, and most open models from 2023 onward ship RMSNorm.

Sanity check: if a candidate says "GPT uses layer norm" and stops there, probe further. The right answer in 2026 is "Pre-LN with RMSNorm, applied to the input of each sublayer."

Positional encodings: RoPE and ALiBi

Transformers are permutation-equivariant — without a position signal, "dog bites man" and "man bites dog" produce the same hidden states. The original paper used fixed sinusoidal embeddings added to the input. GPT-2 used learned absolute position embeddings, capped at 1024 or 2048 positions. Both approaches have a fatal flaw at scale: the model cannot generalize to sequences longer than what it saw at training time.

RoPE (Rotary Position Embedding) solves this by encoding position as a rotation in the embedding space. Specifically, the query and key vectors at position m are rotated by an angle proportional to m before the dot product is computed. The dot product of a rotated q at position m and a rotated k at position n depends only on m - n — the relative offset, not the absolute positions. This is what gives RoPE its length extrapolation property: a model trained with 4k context can be served at 8k or 16k with mild degradation (and even better with YaRN or NTK scaling on top).

ALiBi (Attention with Linear Biases) takes a different route: instead of modifying queries and keys, it adds a linear penalty to the attention scores proportional to the token distance. Each head gets a different slope. ALiBi was used in BLOOM and MPT and shows even stronger extrapolation than vanilla RoPE on very long context, but the field has converged on RoPE plus scaling tricks because RoPE composes more cleanly with KV-cache implementations.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Attention variants: MHA, MQA, GQA

A modern interview will almost always probe the difference between multi-head attention and its production variants. The motivation is inference memory: at generation time, you cache the key and value tensors for every previous token so you do not recompute them. That KV-cache scales as 2 * num_layers * num_heads * head_dim * seq_len * batch * 2 bytes (fp16). At long context it dominates memory.

Variant Q heads K/V heads KV-cache size Quality Used by
MHA (multi-head) H H 1.0x Best GPT-2, GPT-3
MQA (multi-query) H 1 1/H Slight drop PaLM, Falcon
GQA (grouped-query) H H/g g/H ~MHA Llama 2/3, Mistral, GPT-4-class

For a 70B model with 64 query heads and 8 KV groups, GQA cuts KV-cache memory by 8x versus MHA with essentially no quality loss on standard benchmarks. That is why GQA is the default at production scale: it is the only attention variant that lets you serve 128k context on a single 80GB GPU without aggressive quantization.

A rough parameter sketch helps too. For a transformer block with hidden size d, the breakdown is approximately:

Component Parameters Share at d=4096
Attention (QKV + output proj, MHA) 4 * d^2 ~33%
FFN (with SwiGLU, ~2.67 * d hidden) ~8 * d^2 ~66%
Norms ~2 * d <0.1%
Total per layer ~12 * d^2 100%

For Llama-3 8B with d=4096 and 32 layers, that comes out to roughly 32 * 12 * 4096^2 ≈ 6.4B parameters in the blocks, plus around 1.6B in embeddings and the unembedding matrix (vocab size 128k * 4096 ≈ 0.5B, shared or tied in some configs). Memorizing this back-of-envelope formula 12 * d^2 * L is what separates a candidate who has trained a model from one who has only read about them.

Activations and the FFN: SwiGLU

The feed-forward block is the larger half of every Transformer layer. The original recipe is FFN(x) = max(0, xW_1) W_2, a two-layer MLP with ReLU and a 4x hidden expansion. GPT-2 swapped ReLU for GELU, which smooths the activation around zero and gives a small but consistent quality bump.

The modern recipe is SwiGLU, a gated variant. It uses three weight matrices instead of two and computes SwiGLU(x) = (SiLU(xW_gate) ⊗ xW_up) W_down, where is elementwise multiplication and SiLU is x * sigmoid(x). The gating mechanism lets the network learn to suppress channels dynamically. To keep the parameter count comparable to a plain FFN, the hidden dimension is shrunk from 4d to roughly 2.67d (because there are now three matrices instead of two).

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, dim: int, hidden_mult: float = 2.67):
        super().__init__()
        hidden = int(dim * hidden_mult)
        self.w_gate = nn.Linear(dim, hidden, bias=False)
        self.w_up   = nn.Linear(dim, hidden, bias=False)
        self.w_down = nn.Linear(hidden, dim, bias=False)

    def forward(self, x):
        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))

In ablation studies SwiGLU beats GELU by a stable ~0.4 perplexity at constant compute, which is why every Llama-family model uses it.

Common pitfalls

The first trap candidates fall into is conflating encoder-decoder and decoder-only. T5 and the original Transformer are encoder-decoder; BERT is encoder-only; GPT, Llama, Mistral, and Qwen are decoder-only. The error usually shows up as "GPT has cross-attention between encoder and decoder," which is wrong — there is no encoder, only causal self-attention. The fix is to draw the block on the board and label every arrow before you start narrating.

Another common slip is to claim that causal masking happens at the embedding layer. It does not. The mask is applied to the attention scores inside each self-attention sublayer, after the QK dot product and before the softmax. Embeddings are fully shared and unmasked — every token sees the same embedding table. Confusing where the mask lives suggests you have only read about Transformers, never implemented one.

A third pitfall is treating layer norm placement as a stylistic choice. Post-LN versus Pre-LN is a training-stability question. If you say "either works, just pick one" in an interview at a place that actually trains models, that is a flag. The right answer is: Post-LN is fine for shallow encoders like the original BERT, Pre-LN is mandatory past about 24 layers, and RMSNorm is a free speedup on top of Pre-LN.

The fourth pitfall is forgetting that the unembedding matrix is huge. Vocab size 128k times hidden size 4096 is 524M parameters in a single layer. Many candidates quote 8B for Llama-3 8B and act surprised when you point out that 1-2B of those parameters live outside the transformer blocks. If the interviewer probes parameter counts, name the embedding matrix and whether it is tied to the output projection.

The fifth trap is mixing up RoPE with the original sinusoidal embeddings. Sinusoids are added to the input embeddings once and then forgotten; RoPE rotates queries and keys inside every attention layer. They are not interchangeable, and the rotation property is precisely what gives RoPE its relative-position behavior.

If you want to drill ML-systems questions at this depth every day, NAILDD is launching with hundreds of data science interview problems across exactly this pattern.

FAQ

Is GPT-4 still a pure decoder-only Transformer?

The public information is thin, but the architecture is widely understood to be a mixture-of-experts decoder-only Transformer — each FFN block routes tokens to a small subset of expert sub-networks rather than computing a single dense FFN. The attention path remains standard causal self-attention with GQA. So at the block level it is still decoder-only, just with sparse FFNs. If an interviewer asks "what is the next big architectural shift after GPT-3," sparse MoE is the safe answer.

Why is RMSNorm preferred over LayerNorm in modern LLMs?

RMSNorm drops the mean-subtraction step and only divides by the root-mean-square of the activations. The accuracy difference at scale is statistically indistinguishable in most ablations, but you save one reduction kernel and a small amount of memory traffic per layer. At 100+ layers and trillions of tokens, that compounds into measurable wall-clock training savings. There is also a softer argument that mean-subtraction can interact badly with the residual stream in very deep networks, though the empirical evidence on that is mixed.

How does GQA differ from MQA at training time?

At training time both are essentially free — you just reduce the number of K and V projections. The difference shows at inference: MQA shares a single KV head across all query heads (maximum compression, slight quality loss), while GQA shares groups of query heads to a smaller number of KV heads (tunable compression, quality near MHA). GQA was introduced in 2023 specifically because MQA's quality drop, though small, was visible on instruction-following benchmarks at the 70B scale.

How does the KV-cache scale with context length?

The KV-cache grows linearly. For a 70B model with GQA (8 KV groups, 128 head dim, 80 layers) in fp16, a single token costs roughly 2 * 80 * 8 * 128 * 2 bytes = 327 KB. At 128k context that is ~42 GB per request, which is why long-context serving needs GQA, quantized KV cache, or paged attention.

Is there a simple heuristic for how many parameters live in attention versus FFN?

For a standard decoder block with hidden size d, attention contributes roughly 4 * d^2 parameters (Q, K, V, output projections) and SwiGLU FFN contributes roughly 8 * d^2 (three matrices at ~2.67d hidden). So attention is about a third of the block, FFN is two thirds, and norms are a rounding error. That ratio holds across model scales from 1B to 400B and is the answer you should give if asked "where does the compute go in a Transformer."