Few-shot learning in DS interviews
Contents:
Why interviewers ask about few-shot
Few-shot learning is the default modern answer to small-data problems, and that is exactly why hiring managers at Google, OpenAI, Anthropic, Stripe, and Notion now slot it into the ML-system round. The question is rarely "define few-shot." It is usually "you have only 30 labeled examples per class — what do you do?" The bad candidate jumps to "collect more data." The good candidate sketches three or four concrete options, picks one, and defends the choice in under two minutes.
The reason this question is so common in 2026 is that real production teams hit small-data walls constantly: new product verticals, new languages, new fraud patterns, new content categories. Labeling a million examples is not the move when the category may not exist in six months. Few-shot is the budget-conscious answer, and interviewers want to see whether you can map the toolbox — prompting, prototypical networks, LoRA, in-context learning — to the constraint at hand.
The trap is treating "few-shot" as one technique. It is a family. Below is the structured way to walk through it on a whiteboard, in roughly the order an experienced interviewer expects you to mention them.
Load-bearing trick: Always start your answer by naming the constraint — labels, latency, or budget — and then pick the technique. "Few examples, no GPU budget, latency-tolerant" → prompting. "Few examples, owned model, latency-sensitive" → LoRA.
Few-shot prompting in LLMs
The simplest few-shot technique is stuffing labeled examples directly into the prompt of a large language model and letting in-context learning do the rest. No training, no gradient updates, no infra — you ship the same day. This is why most product teams reach for it first when a new classification or extraction task lands on Monday morning.
Classify sentiment as positive, negative, or neutral.
Example 1: "Loved this movie!" -> positive
Example 2: "Worst experience ever" -> negative
Example 3: "Pretty average, no complaints" -> neutral
Now classify: "{user_input}"The trade-offs are real and worth saying out loud in an interview. Latency rises linearly with the number of in-prompt examples because every inference re-reads them. Cost per call goes up for the same reason. And the model never globally learns the task — it re-derives the pattern every request, which means a prompt change at example 8 can flip behavior for every user simultaneously.
Sensible defaults you can quote: 3 to 8 examples per class for classification, ordered most-to-least representative; a clear schema in the system prompt; and a deterministic temperature (0 or 0.1) for production. If you are above 16 examples per class, you are usually better off moving to LoRA or a small distilled model.
Prototypical networks
When the task is image or embedding-based — visual product categorization, face verification, document type detection — the textbook few-shot move is the prototypical network from Snell et al. 2017. The idea is to compute a single embedding per class from your support set and classify queries by nearest centroid in embedding space.
Support set: {(image_dog_1, dog), (image_cat_1, cat), ...}
For each class c:
prototype_c = mean(encoder(image) for image in support[c])
For a query image q:
predicted_class = argmin_c distance(encoder(q), prototype_c)The encoder is usually a pre-trained vision backbone — CLIP, DINOv2, or a domain-specific ResNet — frozen at inference. You typically need 1 to 5 examples per class in the support set, hence the name "5-shot" or "1-shot" learning. Distance is cosine in modern setups, Euclidean in the original paper; cosine usually wins because the encoders are already normalized.
| Technique | Typical examples per class | Training cost | Inference latency | Best for |
|---|---|---|---|---|
| Few-shot prompting | 3-8 | none | high (LLM call) | text, fast iteration |
| Prototypical networks | 1-5 | none (frozen encoder) | low | images, embeddings |
| LoRA fine-tuning | 50-500 | minutes on 1 GPU | low | owned model, custom domain |
| Linear probe | 20-200 | seconds | very low | quick baselines |
| Full fine-tune | 1k+ | hours | low | last resort on tiny data |
Prototypical networks are embarrassingly cheap at inference — one encoder pass plus a tiny distance computation — which is why they still get picked over LLM-based approaches whenever a team needs sub-100ms responses on visual data.
Fine-tuning on small datasets
If you own the model and the task is going to stick around, fine-tuning beats prompting on cost-per-call after a few thousand requests per day. The question is how to fine-tune without overfitting your 47 labeled examples.
Gotcha: Full fine-tuning a 7B-parameter model on 50 examples will overfit catastrophically by epoch 2. Do not do this in an interview answer without flagging the risk.
The three options worth naming, in order of how much you should trust them on small data:
The linear probe freezes the entire backbone and trains only a fresh classifier head. Two layers, a few hundred parameters, fits in seconds. It is the right first move because it tells you whether the pre-trained representation already separates your classes — and if it does not, no fancier method will save you. Strong baseline, often shipped as-is.
The LoRA adapter trains a pair of low-rank matrices injected into each attention block. You touch maybe 0.1% to 1% of total parameters, the original weights stay frozen, and you can hot-swap adapters per task. For a 7B model this means a few minutes on a single consumer GPU and a 5-50 MB adapter file. LoRA has become the default modern fine-tune for LLMs and vision-language models — quote it in interviews.
A full fine-tune unlocks the most capacity but is the riskiest on small data. Reserve it for cases where you have at least a few thousand labeled examples and the linear probe and LoRA both underperform. The empirical rule from the LoRA paper and follow-ups: LoRA matches full fine-tune within 1-2 points on most downstream tasks while using a fraction of the compute.
In-context learning
In-context learning is the emergent capability behind few-shot prompting, and it deserves its own answer when interviewers dig deeper. The phenomenon: a large enough language model, shown a handful of input-output pairs in the prompt, will infer the pattern and apply it to a new input — with zero gradient updates.
Translate English to French.
"Hello" -> "Bonjour"
"Goodbye" -> "Au revoir"
"Thank you" -> ?The model returns "Merci." Nothing was trained. The pattern was inferred entirely from context. This is what the Brown et al. 2020 GPT-3 paper introduced as a major finding, and it is what makes modern LLMs feel almost-magical on novel tasks.
Three properties worth memorizing for the interview:
It is emergent with scale. Small models — under roughly 1B parameters — show weak in-context learning. The capability sharpens as models grow, which is why GPT-2 felt mediocre at few-shot and GPT-4 feels strong. Quote this phase transition; interviewers love it.
It is bounded by the context window. With a 128k-token window, you can fit a few hundred examples; with a 4k window, maybe a dozen. The window grew roughly 30x in the last three years, which is why "many-shot prompting" — 100+ examples in-context — is now a viable technique on its own.
It is brittle to ordering. The same examples in a different order can change accuracy by 5-15 points on classification benchmarks. Production teams either fix an ordering and A/B test changes, or use retrieval to pick examples per query.
Common pitfalls
The most common mistake in interview answers is conflating few-shot prompting with few-shot fine-tuning. They are not the same. Prompting puts examples in context at inference time and updates no weights. Fine-tuning, even with LoRA, updates parameters offline and bakes the behavior into the model. Interviewers will probe this distinction by asking "but what if the prompt gets too long?" — and if you cannot explain why fine-tuning solves the latency problem, you have lost the round.
Another trap is claiming few-shot beats supervised learning when you actually have plenty of labels. If a team has 50,000 labeled examples, a standard supervised classifier almost always wins on accuracy, latency, and cost. Few-shot is for the small-data regime — say so, and gate your answer with "if labels are scarce." A senior interviewer will mentally promote you for naming the boundary.
A third pitfall is ignoring the support-set distribution in prototypical networks. If your support examples for the "dog" class are all golden retrievers, your prototype is a golden retriever, not a dog. The fix is either to sample diverse support examples or to use multiple prototypes per class with a soft nearest-neighbor classifier. This bites in production constantly — interviewers from CV-heavy teams will ask about it directly.
A fourth, increasingly common one in 2026: forgetting that in-context learning is non-stationary. Models drift across versions. A prompt that scored 92% on GPT-4 in 2024 may score 87% on a newer release without notice. Production few-shot systems need an evaluation harness pinned to a labeled holdout — otherwise you are flying blind on every model upgrade.
Finally, candidates often skip the cost analysis. Few-shot prompting at scale is expensive. A 1k-token prompt at $3 per million input tokens plus a few hundred output tokens runs roughly $0.005 per call — at 1M calls/day that is $5k/day, or $1.8M per year. LoRA serving the same task costs cents per thousand requests on a single GPU. State this trade-off out loud; interviewers want to hear that you think about unit economics.
Related reading
- BERT vs GPT for data science interviews
- AI agents for data science interviews
- Self-supervised learning for computer vision interviews
- SQL window functions interview questions
If you want to drill ML-system questions like this every day, NAILDD ships hundreds of data-science prompts across exactly these few-shot and small-data patterns.
FAQ
What is the difference between zero-shot and few-shot learning?
Zero-shot means the model receives the task description but zero examples of correct outputs — "classify this review as positive or negative" with no exemplars. Few-shot adds a handful of input-output pairs to the prompt or support set. Empirically, jumping from zero to even three examples usually gives the largest accuracy boost; gains taper sharply after 8-16 examples and can even regress at 50+ because of context dilution. Interviewers like candidates who can quote this curve.
When should I pick prototypical networks over LoRA fine-tuning?
When you genuinely have 1 to 5 examples per class, the task is embedding-friendly (vision, audio, document type), and you need sub-100ms inference. Prototypical networks need no training and have minimal infra. LoRA wins when you have 50+ examples per class, the task is generative or sequence-to-sequence, and you control the serving stack. The rule of thumb: prototypes for retrieval-like tasks, LoRA for behavior-shaping tasks.
How many examples is "few" in few-shot learning?
The literature is loose, but the practical convention is 1 to 32 examples per class. Below 1 is zero-shot. Above 32 most teams stop calling it few-shot and start calling it low-resource supervised learning. The interesting boundary is between 8 and 16 examples per class — that is the regime where prompting, LoRA, and linear probes all become viable simultaneously and the choice depends on serving constraints, not accuracy.
Does in-context learning actually "learn"?
Strictly, no — no weights change. Mechanistically, the consensus view as of 2026 is that the model performs implicit Bayesian inference over plausible tasks given the examples, and the attention layers act like a tiny in-context optimizer. Interviewers do not expect you to prove this; they want you to say "the model is not learning in the gradient-descent sense, but the behavior is functionally similar over the prompt." That phrasing scores points.
What is the catch with many-shot prompting?
Many-shot — fitting 100+ examples in a long context window — narrows the gap with fine-tuning on several benchmarks, but it is token-expensive at inference, sensitive to example ordering, and slow on long contexts. The right framing in an interview: many-shot is a strong "no-training" baseline you reach for when you have labels but no GPU budget, with the understanding that you will eventually graduate to LoRA once traffic justifies it.
Is few-shot learning still relevant given foundation models keep improving?
Yes, and arguably more relevant. Bigger models make prompting cheaper to get right, LoRA cheaper to train, and prototypical networks stronger because their frozen encoders are themselves bigger and better. Few-shot is not a 2020-era technique that gets obsolete — it is the layer where small-data problems will keep being solved, just on top of progressively better foundations.