NLP for the data science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What interviewers actually ask

NLP shows up in nearly every modern data science loop, and the depth of the questioning scales with the role. A generalist DS candidate at Stripe or DoorDash will get subword tokenization, embeddings, and one BERT classification scenario. An NLP specialist at Meta or Anthropic will get pushed into attention math, fine-tuning trade-offs, and evaluation design. An applied LLM engineer at OpenAI or Notion will get prompt engineering, RAG pipeline failure modes, and alignment.

The load-bearing trick across all three levels is the same: the interviewer wants you to map a business task to the right method, not recite paper titles. If somebody asks "how would you classify support tickets by intent at 10M tickets a day?" the wrong answer is "I'd use GPT-4". The right answer walks through a fine-tuned encoder (DistilBERT or a small RoBERTa) for ~$200/month inference, with reasoning about latency and label budget.

Load-bearing trick: Memorize the method evolution (bag-of-words → word2vec → BERT → GPT) and the task matrix (classification, NER, QA, summarization). If you can draw both on the whiteboard inside two minutes, you have already passed the NLP screen at most companies.

Tokenization

Text becomes a sequence of integer IDs before any model sees it. The three families you need to name:

Word-level treats each whitespace-delimited word as its own token. Vocabulary balloons past 1M for any non-toy corpus, and any unseen word at inference time becomes the dreaded <UNK> token. Nobody ships this in 2026 outside of legacy systems.

Character-level treats each character as a token. Vocabulary stays tiny — a few hundred symbols — but sequences become 5-10x longer, which kills self-attention compute (it is quadratic in sequence length). Used mostly in niche settings like protein sequences or noisy OCR text.

Subword is what every modern model uses. Frequent words stay whole; rare words get split into recognizable sub-pieces. The three flavors that come up in interviews:

Tokenizer Algorithm Used by Vocab size
BPE Iteratively merge most-frequent byte pairs GPT-2/3/4, LLaMA 50k-100k
WordPiece Merge by likelihood gain, not raw frequency BERT, DistilBERT ~30k
SentencePiece Treats raw text as a stream, no pre-tokenization T5, mBART, multilingual 32k-256k

The interview question is almost always "why subword?" The answer has three parts: it solves OOV (a new word like "tokenomics" splits into token + ##omics), it shrinks the vocabulary by 10-30x vs word-level, and it handles morphologically rich languages where a stem combines with dozens of suffixes.

The embedding evolution

This is the question that separates candidates who read one blog post from those who actually understand the field. The method timeline matters because each step fixed a specific failure of the previous one.

Era Method Core idea Killer limitation
2003-2013 Bag-of-words / TF-IDF Count words, weight by rarity No notion of meaning; "great" and "excellent" are orthogonal
2013 word2vec / GloVe Dense vector per word from co-occurrence One vector per word; "bank" means river-bank and money-bank simultaneously
2018 ELMo, BERT Contextual embeddings from a deep encoder Bidirectional, but expensive; not generative
2018-now GPT family Causal decoder, scale-driven emergent abilities Costly per-token; weaker on pure classification than a fine-tuned encoder

The classic word2vec demonstration — vector("king") - vector("man") + vector("woman") ≈ vector("queen") — is still cited in interviews, but every production NLP stack today uses contextual embeddings, where the vector for "bank" depends on whether the surrounding tokens say "river" or "deposit". Plain word2vec survives as a fast baseline for retrieval and as a teaching example.

If you can articulate why contextual beats static embeddings in three sentences, you have a leg up on most candidates.

BERT and encoder models

BERT (Bidirectional Encoder Representations from Transformers) is encoder-only. Pretraining is Masked Language Modeling: ~15% of tokens are replaced with [MASK] and the model learns to predict them from both sides of context. The original paper also used Next Sentence Prediction, but RoBERTa showed NSP was mostly noise, and modern variants drop it.

Encoders shine on tasks where you need a representation of an entire span and you can afford bidirectional attention:

  • Text classification with a small head on the [CLS] token (sentiment, intent, spam).
  • Named entity recognition with a per-token classification head.
  • Extractive QA with two heads predicting answer-start and answer-end spans.
  • Sentence-level embeddings via Sentence-BERT for retrieval and clustering.

The interview question is "why bidirectional?" — because understanding the word "bank" benefits from both the left ("the river") and the right ("was flooding") contexts. A pure decoder like GPT only sees left context during training and ends up weaker on classification at fixed parameter budget.

GPT and decoder models

GPT (Generative Pre-trained Transformer) is decoder-only with causal masking — each token attends only to previous tokens. Training objective: predict the next token. The scale arc is worth memorizing because interviewers love it:

Model Year Parameters Context window
GPT-1 2018 117M 512
GPT-2 2019 1.5B 1,024
GPT-3 2020 175B 2,048
GPT-4 2023 ~1T+ (mixture-of-experts, undisclosed) 8k-128k
GPT-4o / Claude 3.5 / Gemini 1.5 2024 undisclosed 128k-2M

Decoders fit naturally for generation, few-shot in-context learning, agentic tool use, and conversational interfaces. The trap candidates fall into: assuming "bigger model = always better". For a single-domain classification task with 100k labeled examples, a fine-tuned DistilBERT will beat GPT-4 on accuracy, latency, and cost simultaneously.

Gotcha: "We need to classify 50M support emails by topic" is not a job for GPT-4. The right answer is a fine-tuned encoder with ~$0.0001 per inference, not a frontier model at ~$0.01 per inference. Pick the model that fits the loop's economics, not the one that sounds impressive.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Fine-tuning vs prompt engineering

Once you have a pretrained model, you have two roads to a task-specific system.

Fine-tuning updates model weights on your labeled data. The three variants:

  • Full fine-tuning updates all parameters. Best quality, highest GPU cost. Practical up to ~10B parameters on a single A100/H100 node.
  • LoRA / adapters insert small trainable matrices and freeze the base weights. ~0.1-1% of parameters trained, near-full-FT quality on most tasks. The default for any model over 7B.
  • Prompt tuning / prefix tuning trains a small soft-prompt embedding and freezes everything else. Cheapest, weakest, useful for very narrow tasks.

Prompt engineering leaves the model frozen and changes the input:

  • Zero-shot: describe the task in the prompt, no examples.
  • Few-shot: include 2-8 worked examples (in-context learning).
  • Chain-of-thought: ask the model to "think step by step", which improves reasoning on math, multi-hop QA, and code.
  • RAG (Retrieval-Augmented Generation): retrieve relevant passages from a vector DB and stuff them in the prompt. The default architecture for any question-answering product touching domain documents.

The interview question is "when do you fine-tune vs prompt?" and the answer has three columns:

Choose When
Fine-tuning Stable schema, ≥1k labeled examples, latency or unit-cost matters, narrow domain
Prompting + RAG Frequently changing facts, no labels yet, latency budget allows 1-3s, broad domain
Hybrid (small FT + RAG) Production NLP at scale, ~2024-2026 industry default

Task-to-method matrix

This is the cheat sheet interviewers expect you to draw from memory. The matrix matches the four canonical NLP tasks to the model family that fits, plus the evaluation metric you would actually report.

Task Best fit Why Standard metric
Text classification Fine-tuned encoder (BERT, RoBERTa, DistilBERT) Bidirectional context, cheap inference, fixed-shape output F1 (macro for imbalance), PR-AUC
NER Encoder + token classification head Per-token labels, span boundaries matter F1 over spans (exact-match)
Extractive QA Encoder with start/end span heads Answer lives inside the passage, no generation needed Exact Match, F1 over tokens
Abstractive summarization Encoder-decoder (T5, BART) or LLM Output is new text, length-controlled ROUGE-1/2/L, plus human eval
Open-domain QA RAG: retriever + decoder LLM External knowledge required, freshness matters Retrieval recall@k, answer F1, faithfulness
Translation Encoder-decoder (NLLB, mBART) or LLM Source-to-target sequence mapping BLEU, COMET, chrF
Chat / instruction following Decoder LLM with RLHF/DPO Open-ended generation, multi-turn Human preference, MT-Bench, harm rate

A senior candidate at Anthropic or Snowflake will be pushed further: what if classification labels are added monthly? (Hybrid: encoder for the frozen 90%, few-shot LLM gate for new labels.) What if QA documents update hourly? (RAG with a freshness budget on the retriever index.) The matrix is the starting point, not the answer.

Common pitfalls

The mistake junior candidates make most often is ignoring language and domain mismatch. An English-only BERT gives a ~30% absolute accuracy drop on French or Japanese tickets. The fix is multilingual checkpoints (XLM-R, mBERT) or a language-specific variant. Domain mismatch is the same problem — a Wikipedia-pretrained BERT underperforms a domain-tuned variant on medical or legal corpora by 5-15 F1 points.

A second pitfall is using accuracy on imbalanced classification. If 98% of support tickets are "general inquiry", a model that always predicts that class scores 98% accuracy and is useless. Macro-F1 or per-class PR-AUC is the right reporting target. Weight the loss (class_weight='balanced') or oversample the minority, then report per-class precision and recall.

A third trap is ignoring context length limits. Vanilla BERT caps at 512 tokens — roughly 350-400 English words. Longer documents get truncated, and most candidates do not realize their model is silently losing the back half of every legal contract. Fixes: sliding windows, hierarchical encoders, or long-context architectures like Longformer (4k), BigBird (4k), or flash-attention encoders pushing 16k-32k.

A fourth one is reaching for an LLM when an encoder fits better. Asking GPT-4 to label 10M emails by sentiment is a way to spend $100k on a job a $200 fine-tuned RoBERTa would do better, faster, and more reproducibly. Use frontier LLMs where their strengths matter — open-ended generation, few-shot adaptation, complex reasoning — not where you have plenty of labels and a fixed schema.

A fifth, increasingly common pitfall is shipping a RAG system without measuring retrieval recall separately from generation. If the retriever misses the relevant passage, no LLM will recover. Measure recall@5 on a held-out QA set first; only then evaluate end-to-end faithfulness.

If you want to drill NLP scenarios like this every day, NAILDD is launching with hundreds of NLP and ML system-design questions from real DS interview loops.

FAQ

What is attention in one paragraph?

Attention is a learned weighted average over a sequence. When the model processes a target token, it computes a similarity score (query against keys), normalizes those into weights, and combines value vectors. Self-attention uses the same operation where queries, keys, and values come from one sequence — stacking many such layers is what makes a transformer. It replaced RNNs because every position computes independently, so you train on GPU at sequence-level batches instead of stepping through time.

Word embeddings vs sentence embeddings — which do I want?

Word embeddings (word2vec, fastText) give one vector per word and are the right primitive for token-level tasks or features into a classical model. Sentence embeddings (Sentence-BERT, OpenAI text-embedding-3, Cohere embed-v3) give one vector per phrase or document — what you want for semantic search, deduplication, clustering, or any "is A similar to B" question. Production systems now use sentence-level contextual embeddings; raw word2vec is mostly a teaching artifact.

How much NLP project work do I need for a junior DS role?

Two end-to-end projects is the realistic bar at companies like Linear, Airbnb, or DoorDash for entry-level applied DS. One should be a fine-tuning project — pick a public dataset (AG News, IMDB, CoNLL-2003), fine-tune a DistilBERT, and report F1 with a confusion matrix. The second should be a RAG or embedding-search project — load a small corpus, build a FAISS index, and answer questions through retrieval + LLM. Both on GitHub with a clear README credibly demonstrates you can ship.

Is fine-tuning still relevant when LLMs are this good?

Yes, and it is becoming more important, not less. A fine-tuned 1B-parameter open-weight model on your own infrastructure costs roughly 50-200x less per inference than a frontier API call, and for narrow tasks it matches or beats the API on quality. The 2026 production pattern at DS-mature companies is hybrid: route easy / high-volume / well-labeled traffic to a fine-tuned small model, and route long-tail / novel / open-ended traffic to a frontier LLM through prompting and RAG.

Are these answers official?

No. This article is built from the canonical papers (Vaswani 2017 on attention, Devlin 2018 on BERT, Brown 2020 on GPT-3, plus the LoRA and RAG papers) and from candidate debriefs across applied DS and ML loops at large tech companies. Treat it as a study guide, not a substitute for the originals.