Continual learning for Data Science interviews
Contents:
Why continual learning shows up in interviews
Picture this. You ship a fraud model on Monday. On Friday the attackers shift, your data drifts, and the head of risk asks why precision tanked from 0.91 to 0.62 in four days. Retraining from scratch every week is wasteful. Retraining only on the new fraud signatures wipes out the model's memory of last quarter's attacks. This tension — adapt fast, forget nothing — is exactly what continual learning tries to solve, and it is exactly why senior DS interviewers at OpenAI, Anthropic, Meta, and Stripe started asking about it.
The term covers a family of techniques for training a single model on a stream of tasks or distributions rather than one fixed dataset. The interviewer wants to see you pick the right tool when production reality is messier than a Kaggle leaderboard. Most candidates can name catastrophic forgetting. Far fewer can explain when replay beats EWC, or why progressive networks still ship at Tesla and DoorDash for tail tasks.
Catastrophic forgetting in plain English
A neural network trained sequentially — Task A first, then Task B — typically loses most of its Task A performance the moment you finish fine-tuning on B. Accuracy on the original validation set can collapse from 94% to under 30% within a few hundred gradient steps. That is catastrophic forgetting, and it has a clean mechanical cause: gradient descent has no incentive to preserve old optima. If a weight that was critical for Task A is also useful for Task B's loss surface, it gets overwritten. The network does not "know" it had a previous job.
A useful mental model: picture the loss landscape as overlapping bowls. Task A has its bowl, Task B has its bowl, and their minima usually live in different valleys. Vanilla SGD slides down whichever bowl you put in front of it, no memory of where it was before. Three families of fixes exist, and a strong interview answer names all three before going deep on one.
Load-bearing trick: If you only remember one thing — continual learning is about constraining how much the weights can move relative to the previous task, whether through extra data (replay), extra penalty terms (regularization), or extra parameters (architecture).
Replay-based methods
The simplest cure is to keep showing the model some old data while it learns the new task. This is called experience replay and it has been the workhorse of practical continual learning since the early 2010s. You maintain a memory buffer — usually a few thousand to a few hundred thousand examples — and at each training step you mix new-task batches with sampled old-task batches.
# Minimal replay loop
buffer = ReservoirBuffer(capacity=20_000)
for x_new, y_new in new_task_loader:
x_old, y_old = buffer.sample(batch_size=64)
x = torch.cat([x_new, x_old])
y = torch.cat([y_new, y_old])
loss = criterion(model(x), y)
loss.backward()
optimizer.step()
buffer.add(x_new, y_new)The math is uninteresting. The engineering is not. Reservoir sampling keeps the buffer distribution close to the long-run task mixture without unbounded growth. Class-balanced sampling beats uniform when the new task is heavily skewed, which is almost always the case in fraud and abuse domains.
Generative replay is the high-budget cousin: instead of storing raw samples, you train a VAE or diffusion model on old data and sample synthetic examples on demand. Netflix and Spotify recommendation teams reach for this when privacy or storage rules forbid keeping raw user logs. The cost: two models to maintain, and the generator's failure modes become the classifier's.
| Method | Memory cost | Sample quality | When it shines |
|---|---|---|---|
| Reservoir replay | Linear in buffer size | Perfect (real data) | Default choice, small-to-medium streams |
| Class-balanced replay | Same | Perfect | Heavy class imbalance across tasks |
| Generative replay | Constant after training | Approximate | Privacy or storage constraints |
| Latent replay | Lower than raw replay | Lossy | Vision backbones with large frozen layers |
Regularization-based methods (EWC and friends)
If you cannot store old data — say, the original training set is locked behind a compliance wall — you can instead penalize the model for moving the weights that mattered for the previous task. Elastic Weight Consolidation (EWC) was the breakthrough paper here. The idea: estimate which weights are important for Task A using the Fisher information matrix, then add a quadratic penalty to the new task's loss that pulls those weights back toward their Task A values.
L_total = L_new + (λ / 2) * Σ_i F_i * (θ_i - θ_i*)²Here F_i is the diagonal of the Fisher information for weight i, θ_i* is the post-Task-A value, and λ controls how much old knowledge you defend. High λ = stiffer, less plasticity; low λ = forgets faster. In practice you tune λ between 0.1 and 100 on a held-out set per task transition.
EWC's appeal: zero extra data, zero extra parameters. Its weakness: the Fisher diagonal is a crude approximation of weight importance — it ignores correlations between weights and degrades the longer the task stream gets. By task 10 or so, plain EWC usually breaks. Online EWC and Synaptic Intelligence (SI) keep a running importance estimate and stretch the horizon, but reach for replay if you have more than five or six tasks ahead.
Interviewer hint: if asked "why not just use EWC for everything," the right answer mentions the Fisher diagonal approximation and the task-count ceiling.
Architectural methods
The third family side-steps forgetting by allocating fresh capacity per task. Progressive networks literally add a new column of layers for each new task, freeze the old columns, and add lateral connections so the new column can reuse old features. Zero forgetting by construction. The cost is obvious: parameter count grows linearly in the number of tasks.
The interview-friendly version — and one in active production use — is LoRA-style adapters. You freeze the base model and train small low-rank update matrices per task, usually adding 0.1% to 1% extra parameters per adapter. To serve task K, swap in adapter K. To serve a mixture, average or route between adapters. Multilingual chatbots, multi-tenant LLM products, and large recommender backbones at Anthropic and Linear use this for per-customer specialization without retraining the trunk.
| Architectural pattern | Forgetting | Parameter growth | Inference cost |
|---|---|---|---|
| Progressive networks | Zero | Linear in tasks | High (all columns active) |
| LoRA / adapter modules | Zero per task | ~1% per task | Low (swap or merge) |
| Mixture-of-experts routing | Near-zero | Sublinear (shared experts) | Medium |
| Prompt tuning / soft prompts | Zero | Tiny (<0.01%) | Lowest |
Sanity check: Adapters are the right answer when the base model is huge and frozen. Replay is the right answer when the base model is small and retraining is cheap. EWC is the right answer when you cannot keep old data and you have a short task horizon.
Common pitfalls
The most common interview-killing mistake is conflating continual learning with online learning. Online learning means one-pass-per-sample, optimized for streaming throughput. Continual learning means a sequence of tasks or distributions, optimized for preserving accuracy on each. A trading desk doing online SGD on tick data is not doing continual learning. A fraud team that quarterly fine-tunes on the new attack patterns while protecting the old ones is. If you blur this in an interview, the next question becomes a trap. Be precise about the regime — task-incremental, class-incremental, or domain-incremental — before you propose a fix.
A second trap is using EWC for too many tasks. The Fisher diagonal becomes a worse and worse approximation as you accumulate task transitions, because the penalty terms compound and the loss surface gets over-constrained. Teams that adopt EWC for what they think is a five-task problem and then quietly grow it to twenty-five tasks discover their model has become unmovable: new tasks plateau early because every weight is anchored. The fix is either to switch to replay with a reservoir buffer, or to adopt Online EWC with a decaying importance estimate so the model can forget what it should forget.
A third pitfall is forgetting to evaluate on every previous task. The whole point of continual learning is to track accuracy on Task 1 through Task K after training on Task K. If you only report the new-task metric, you have measured nothing useful. Senior interviewers will probe this — they want to see you mention backward transfer (does learning a new task help old ones), forward transfer (does prior learning help new tasks), and average accuracy across all tasks seen so far. Anything less is just calling fine-tuning by a fancier name.
A fourth pitfall is ignoring the buffer's distribution drift. If you start with a small reservoir and the data stream is non-stationary, the buffer quietly fills up with recent samples and loses early ones. Six months in, the "replay" is mostly last week. Class-balanced or task-balanced sampling, plus a buffer capacity decision tied to expected task count, are the practical fixes. Buffer size of 5,000 to 50,000 samples is a common sweet spot for tabular and vision pipelines.
A fifth and underrated pitfall is shipping without a rollback path. A bad task transition can degrade aggregate accuracy in ways invisible until a customer complains. Keep the previous checkpoint, shadow-deploy the new one, and define a rollback trigger in advance — for example, roll back if any task's recall drops more than 5 percentage points. Continual learning without monitoring is a foot-gun.
How to answer this in an interview
Strong answers follow four beats. One: define the regime — task-, class-, or domain-incremental. Two: name the three families and one canonical method each. Three: pick the method for the interviewer's scenario and justify it on memory budget, expected task count, and whether old data is accessible. Four: mention evaluation — average accuracy, backward transfer, forward transfer — because this separates a candidate who read a paper from one who shipped.
If you want to drill scenario-based ML interview questions like this against a timer, NAILDD has a growing bank of Data Science problems targeted at exactly this format.
Related reading
- Transformer architecture for DS interviews
- MLOps monitoring for DS interviews
- Feature store for DS interviews
- Deep learning for DS interviews
- Canary and shadow deployment for DS interviews
FAQ
Is continual learning the same as transfer learning?
No, and confusing the two will cost you credibility fast. Transfer learning is a one-shot move: adapt a pre-trained model to one new task, with no requirement to preserve source performance. Continual learning is the sequential, multi-task version where you must keep performing on every previous task as new ones arrive. Transfer learning happily forgets the source. Continual learning treats forgetting as the central failure mode.
When should I prefer replay over EWC in practice?
Reach for replay whenever you can legally keep at least some samples from previous tasks and you expect more than a handful of task transitions. Replay degrades gracefully — you can always grow the buffer or rebalance sampling. EWC degrades sharply once the Fisher approximation breaks down, usually somewhere between five and ten sequential tasks. The exception is privacy-sensitive domains where storing raw user data is forbidden; there, EWC or generative replay become the realistic options.
What is the typical memory buffer size I should propose?
For tabular fraud or churn models, a reservoir of 10,000 to 50,000 samples is usually plenty. For vision pipelines using latent replay (storing embeddings instead of raw images), you can stretch to 100,000 to 500,000 without breaking the storage budget. For language models with adapters, you often skip the buffer entirely and rely on the per-task LoRA weights to carry task identity. Quote a number in your interview answer; vague answers read as inexperienced.
How do LoRA adapters relate to continual learning?
LoRA adapters are the production-friendly face of architectural continual learning. Instead of growing a full column per task, you train a small low-rank update — typically rank 4 to 64 — that adjusts the frozen base for one task. Adapters give zero forgetting by construction, scale to dozens of tasks with sub-1% parameter overhead, and hot-swap at inference. This is what most multi-tenant LLM products use today.
What metrics should I report for a continual learning experiment?
Three at minimum. Average accuracy across all tasks seen so far — the headline number. Backward transfer — how much each old task's accuracy changed after learning later tasks; negative values quantify forgetting. Forward transfer — how much faster or better new tasks learn because of prior tasks. Reporting only the latest task's accuracy is the rookie mistake interviewers screen for, so lead with the average and break out per-task numbers in a small table when you write up results.
Is this an officially endorsed answer?
No. This post is independent guidance grounded in the standard references — Kirkpatrick et al. 2017 on EWC, McCloskey and Cohen 1989 on catastrophic forgetting, Rusu et al. 2016 on progressive networks, and Hu et al. 2021 on LoRA — combined with what currently ships in production ML systems. Verify against the specific company's published research before quoting numbers in a final-round interview.