MLOps in a Data Science interview
Contents:
Why MLOps shows up in DS loops
MLOps is the boring half of ML that decides whether your model ever sees a real user. At companies like Stripe, DoorDash, and Netflix, the data scientist who can ship a model and keep it alive is paid roughly $40k to $60k more than the one who hands a notebook to engineering and walks away. That gap is the entire reason this topic exists in the interview loop.
Recruiters at Google, Meta, and Snowflake screen for three things: do you understand the path from training to prod, can you reason about serving latency under 100ms, and do you know what to monitor once the model is live. Junior loops barely scratch this. Middle and senior loops fail candidates who treat the model as the finish line.
Load-bearing trick: Hiring panels do not test MLOps trivia. They test whether you have ever owned a model in production for at least one quarter. Frame every answer around that ownership story — what broke, what you measured, what you rolled back.
A scenario you should expect: a recruiter says your fraud model has been live for six weeks, precision dropped from 0.82 to 0.71, what do you do in the next thirty minutes? The candidate who answers in terms of dashboards, drift checks, and rollback wins. The one who jumps to retraining wastes the slot.
The end-to-end pipeline and maturity ladder
The textbook lifecycle has eight stages and your job is to name them in order. Data collection pulls from warehouse, event stream, or API. Validation checks schema, null rates, and value ranges. Feature engineering transforms raw data through a shared library so train and serve match. Training runs the algorithm. Validation scores against a baseline. Registration writes the artifact to a registry. Deployment promotes to a serving target. Monitoring watches and triggers retraining when guardrails break.
| Stage | Job to be done | Common tooling |
|---|---|---|
| Orchestration | Schedule the pipeline, retry on failure | Airflow, Kubeflow, Vertex AI |
| Data versioning | Reproduce yesterday's training set | DVC, lakeFS, Delta Lake |
| Experiment tracking | Compare 200 runs by metric | MLflow, Weights & Biases |
| Feature store | Identical features in train and serve | Feast, Tecton, Hopsworks |
| Model registry | One source of truth for artifacts | MLflow Registry, SageMaker, Vertex |
| Serving | Wrap the model in an API or batch job | BentoML, TorchServe, KServe |
| Monitoring | Alert when drift or latency breaks | Evidently, WhyLabs, Arize, Prometheus |
Google's MLOps levels are the framework hiring managers reach for when they ask how mature is your current setup. The honest answer is more useful than claiming Level 2 and getting cornered.
| Level 0 — Manual | Level 1 — Pipeline automation | Level 2 — CI/CD for ML | |
|---|---|---|---|
| Training | Notebook, run by hand | Scheduled, parameterized pipeline | Triggered by code, data, or drift |
| Deployment | Engineer copies the pickle | Auto push to staging, manual prod | Canary plus auto rollback |
| Retraining | Quarterly, when someone notices decay | Weekly or monthly cadence | Drift-triggered, no human in the loop |
| Tracking | Spreadsheet, sometimes nothing | Experiment tracker plus registry | Registry plus lineage end-to-end |
| Monitoring | Business team complains | Dashboards on latency and ML metrics | Auto alerts on drift and guardrails |
| Team shape | One DS, no platform support | DS plus part-time MLE | DS, MLE, platform engineer per pod |
Sanity check: If you cannot whiteboard your own system in two minutes — training, registry, serving, monitoring, retraining trigger — you are Level 0 in the interviewer's eyes regardless of your resume.
A useful phrase on the call: we sit at Level 1 on training and deployment, but Level 0 on monitoring. That self-awareness lands better than blanket Level 2 claims.
Batch vs online vs streaming serving
The first architectural question is batch or online? The answer is dictated by two numbers: acceptable latency from event to prediction, and how stale features are allowed to be. Everything else is detail.
Batch inference runs the model on a schedule — nightly, hourly, every six hours. Predictions land in a warehouse or key-value store, the app reads from there. Read-time latency is single-digit milliseconds because you hit a cache, not a model. Event-to-prediction latency can be hours. Fits daily recommendations, weekly churn scores, lead scoring.
Online inference wraps the model in a service answering in 10 to 100ms at p95. Infrastructure cost is real: autoscaling, warm pools, request-level logging. Stack at most Series-B startups: FastAPI or BentoML on Kubernetes, fronted by an autoscaler, monitored with Prometheus and Grafana. Fits ad ranking, fraud, dynamic pricing.
Streaming inference is the hybrid. Events flow through Kafka or Kinesis, a consumer scores them, results go to a downstream sink. Latency is seconds, not milliseconds, with the auditability of a log. Fits anti-money-laundering, IoT scoring, content moderation.
The trap: when do you pick batch? When event-to-prediction can wait at least an hour and the prediction does not depend on features younger than that window. Anything tighter and batch is wrong.
Monitoring and drift
Monitoring splits into three layers. Technical metrics catch infrastructure failures: p50, p95, p99 latency, 5xx error rate, QPS, memory, CPU. ML metrics catch model failures: input distributions, prediction distributions, calibration, ground-truth performance when labels arrive. Business metrics catch the failures that cost money: CTR, conversion, fraud catch rate, revenue lift.
Gotcha: Ground-truth performance is the metric everyone wants and almost nobody has in real time. Fraud labels arrive 30 to 90 days later when chargebacks settle. LTV labels arrive months later. Drift metrics on inputs and predictions are your early-warning system precisely because you cannot measure quality directly.
Data drift means input distributions shifted — average user age moved, a new product category launched. Concept drift means the relationship between features and target shifted, the way the pandemic broke every retail forecasting model in March 2020.
Detection metrics depend on feature type. Use PSI for categorical and binned numerical, with PSI < 0.1 meaning stable, 0.1 to 0.25 moderate drift worth investigating, above 0.25 significant. Use KL or Jensen-Shannon divergence for symmetric distance. Use Kolmogorov-Smirnov for continuous features when you want a p-value.
In practice teams trigger retraining on data drift signals and confirm with concept drift once labels arrive.
A/B testing models in production
Shipping a new model without an A/B test is the most common career-limiting mistake on the MLOps side. Control is the current production model. Treatment is the candidate. Randomization is at the user level, deterministic by user_id hash. Initial split is 90/10 or 95/5 to limit blast radius, then 50/50 once you have signal.
Primary metric is the business outcome. Secondary is the ML-side metric — recall, precision, AUC. Guardrails are non-negotiables — p95 latency, error rate, fairness across protected segments. Duration is whatever power analysis says, never less than a full week.
The classic trap: new model has better recall but worse precision, do you ship? The answer that lands: it depends on which one was the pre-registered primary metric. If precision was primary and dropped, you do not ship. If recall was primary and precision is still within guardrail, you ship. The discipline is deciding before launch, not while staring at the dashboard.
Multi-armed bandits show up in senior loops as the alternative. The pitch is that traffic shifts toward the winning arm continuously. The catch is that bandits assume stationary rewards and short feedback loops, which breaks for delayed-label problems like fraud or LTV. Recommendations and ad ranking are the canonical fits.
If you want to drill experimental design and ML system questions at interview grade, NAILDD is launching with hundreds of problems in this exact pattern.
Experiment tracking and feature stores
Past a team of two, experiment tracking is non-negotiable. Every run logs hyperparameters, metrics, dataset version, code commit, and artifact. MLflow is the open-source default, Weights & Biases is the SaaS default, Neptune competes in the same slot. Pick one and standardize.
Feature stores solve the training-serving skew problem, the silent killer of production models. The story is always the same: a feature was computed one way in training (pandas groupby on a snapshot) and another way in serving (streaming aggregation on Kafka). Definitions diverge, predictions degrade, nobody can find the cause for a month.
| Concern | Without feature store | With feature store |
|---|---|---|
| Train-serve consistency | Two implementations, drift inevitable | One definition, two backends |
| Feature reuse across teams | Copy-paste SQL, divergent versions | Registry of features and owners |
| Backfill for new model | Manual, error-prone | Built-in point-in-time joins |
| Latency at serve time | Whatever your service returns | Sub-10ms key-value lookups |
| Lineage | Ask the original author | Logged source to prediction |
Feast is the open-source baseline. Tecton is the managed offering from the Uber Michelangelo team. Hopsworks rounds out the trio. At smaller companies, Redis or DynamoDB for online plus a warehouse table for offline is a pragmatic start.
Common pitfalls
The most damaging pitfall is treating Jupyter as production. A notebook with no orchestration, no data versioning, and no registered artifact will never survive contact with a real system. The fix is to refactor every reusable notebook into a package with entry points, push it through Airflow or Kubeflow, and log every run.
A close second is skipping data versioning. When the data behind a model changes and nothing pins what trained the artifact, reproducing yesterday's results becomes guesswork. DVC, lakeFS, or Delta Lake all solve this; the choice matters less than committing to one. Without this you cannot prove a regression is the model's fault or the data's.
Ignoring serving latency kills strong models in evaluation. A gradient-boosted tree with AUC 0.91 and 30ms inference ships. A transformer with AUC 0.93 and 400ms inference does not, because the product team has a p95 budget of 150ms. Price latency into the metric trade during selection, not after deployment.
Skipping input distribution monitoring is how teams discover failure two months late through a business metric. Shift on top features is visible within days if logged. A daily job computing PSI versus training is cheap; skipping it costs a quarter of degraded predictions.
Deploying without an A/B test is the final classic. The new model performed better on last quarter's holdout; production today is a different distribution; the new model ships and quietly underperforms. A/B is the only way to know what a model does on today's traffic against the current control.
Related reading
- Feature store in a data science interview
- How to calculate distribution drift in SQL
- ML latency optimization for DS interviews
- ML data versioning for DS interviews
- Guardrail metrics in A/B testing
FAQ
How much MLOps does a junior DS actually need?
Enough to be dangerous and not enough to own the platform. Be fluent in batch versus online inference and able to defend a choice on a specific use case. Know what an experiment tracker is and have used one — MLflow is the cheapest to learn. Be able to wrap a trained model in a FastAPI endpoint and return a prediction. Anything beyond that is a bonus on a junior loop and rarely the deciding factor.
What does a middle-level DS need on top of that?
Drift detection metrics by feature type, including when to use PSI versus KS versus JS divergence. A working understanding of feature stores and the train-serve skew they solve. The ability to design an A/B test with primary, secondary, and guardrail metrics named in advance. Familiarity with at least one deployment pattern beyond a single endpoint — shadow mode, canary, blue-green. Most rejections at the senior boundary come from weak monitoring and A/B answers, not weak training knowledge.
How is MLOps different from DevOps?
DevOps deploys code that does the same thing tomorrow as it does today. MLOps deploys models and the data they depend on, both of which degrade silently even when no engineer touches the system. Monitoring and retraining are first-class parts of the lifecycle, not afterthoughts. The other practical difference is that an ML deployment has two artifacts to version — the model binary and the training data — where a DevOps deployment has one.
When should retraining be on a schedule versus drift-triggered?
Schedule-based retraining is the right starting point because it is simple to operate and easy to explain on a postmortem. Move to drift-triggered once you have stable drift metrics with a known false-positive rate and a retraining pipeline you trust unattended. The hybrid most teams land on: schedule-based as the floor (weekly or monthly), drift-triggered as the override when a signal crosses a threshold.
Is this official guidance from any specific company?
No. This is a synthesis of patterns that recur in DS interview loops across mid-size and large tech employers, plus public engineering blogs from Uber, Netflix, and the major cloud vendors. Use it as a study scaffold, not a rulebook.