MLflow and DVC on a Data Science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why this comes up

When a Stripe MLE recruiter asks "walk me through how you track experiments," they are not testing whether you can spell MLflow. They are checking whether you have lived through the reproducibility wall — the moment a model that scored AUC 0.92 on Tuesday quietly drops to 0.81 on Friday and nobody can explain why. That gap is almost always a missing artifact: a different training split, a forgotten random seed, a feature table that got silently overwritten. Senior loops at Databricks, Netflix, and Airbnb push deeper into model registry stages, deployment hooks, and how you would version a petabyte feature store without checking parquet files into git.

The two tools the interviewer expects you to name are MLflow for experiment tracking plus model registry, and DVC for data and pipeline versioning. Knowing them is table stakes; knowing when to reach for each one — and when to skip both — is the senior signal.

Load-bearing trick: MLflow and DVC are not competitors. MLflow tracks what happened during training; DVC tracks what the inputs and outputs were. A real team usually runs both.

The reproducibility problem in ML

A trained ML model is the product of four moving parts: source code, training data, hyperparameters, and randomness. Change any one and you get a different model. Most candidates can recite this; the interview signal comes from how you talk about each axis.

Code is the easy axis — git already solves it. The other three are the failure modes. You shipped a recommender that hit precision@10 = 0.41 in offline eval; the next sprint you cannot reproduce it because the upstream events_clean table got reprocessed and three columns shifted. Or you forgot to set random_state=42 on the train/test split and your CV score wandered by 1.5 points. Or somebody trained on lr=3e-4 but the wiki says lr=1e-3 because the README was edited later.

The cost compounds: every irreproducible result poisons the next decision. In a Notion-sized team, one bad rerun costs an afternoon. In a Tesla Autopilot-sized team, it costs a regulatory audit.

MLflow: four components

MLflow ships as four pieces that you can adopt independently. The interview-friendly framing is to name all four, then explain which one earned its keep on your last project.

1. Tracking. A logging API plus a UI that records params, metrics, and artifacts per run.

import mlflow

mlflow.set_experiment("churn_xgb_v3")

with mlflow.start_run(run_name="lr-tuning-sweep-7"):
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("max_depth", 6)
    mlflow.log_param("random_state", 42)
    mlflow.log_metric("auc", 0.92)
    mlflow.log_metric("log_loss", 0.21)
    mlflow.log_artifact("plots/confusion_matrix.png")
    mlflow.sklearn.log_model(model, artifact_path="model")

The UI gives you a sortable table of runs with metric columns, a parallel-coordinates plot for hyperparameter sweeps, and side-by-side run diffs. This is the piece that gets adopted first because the value shows up after run number two.

2. Projects. A MLproject YAML file that pins the entry point, the conda environment, and the parameter contract.

name: churn_model

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      lr: {type: float, default: 0.01}
      epochs: {type: int, default: 10}
    command: "python train.py --lr {lr} --epochs {epochs}"

You then invoke mlflow run . -P lr=0.01 -P epochs=20 and get the same environment every time. In practice most teams skip Projects in favor of Docker + Airflow, but it is worth naming on a senior loop.

3. Models. A framework-agnostic save format.

mlflow.sklearn.save_model(model, "out/model")
mlflow.pytorch.save_model(model, "out/model")
mlflow.tensorflow.save_model(model, "out/model")

The payoff is mlflow.pyfunc.load_model("out/model") — a single load call that does not care whether the artifact was an sklearn pipeline, a PyTorch state dict, or an XGBoost booster. The pyfunc wrapper is what makes MLflow Serving and Databricks endpoints possible.

4. Model Registry. A versioned store with promotion stages.

mlflow.register_model("runs:/<run_id>/model", "ChurnXGB")

from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
    name="ChurnXGB", version=3, stage="Production"
)

Stages are None, Staging, Production, and Archived. The clean pattern is: CI promotes a passing model to Staging, a human approves promotion to Production, and the serving layer always loads models:/ChurnXGB/Production so deployments are decoupled from training runs.

Sanity check: if your serving code references a run ID instead of a registry stage, you have skipped the registry step. Fix it before the next on-call rotation.

DVC: data versioning and pipelines

DVC starts from the observation that git breaks above ~100 MB per file and is miserable above 10 GB per repo. Image datasets, parquet feature tables, and trained model binaries all fall outside what git was built for. DVC sidesteps this by storing the heavy payload in S3, GCS, or Azure Blob, and keeping only a small .dvc pointer file in git.

dvc init
dvc remote add -d storage s3://my-team-dvc/store
dvc add data/train.parquet     # writes data/train.parquet.dvc
git add data/train.parquet.dvc data/.gitignore
git commit -m "track training data v1"
dvc push                        # blob goes to S3

Anyone who clones the repo gets the pointer; dvc pull fetches the actual data. Now a git checkout of last quarter's branch gives you last quarter's data too.

The other half of DVC is pipelines — a YAML DAG that knows what depends on what.

stages:
  preprocess:
    cmd: python preprocess.py
    deps:
      - data/raw.parquet
      - preprocess.py
    outs:
      - data/processed.parquet

  train:
    cmd: python train.py
    deps:
      - data/processed.parquet
      - train.py
    params:
      - lr
      - epochs
    outs:
      - models/model.pkl
    metrics:
      - metrics.json

dvc repro walks the DAG and only reruns stages whose inputs changed. This is the bit that converts a Jupyter notebook into something a teammate can actually reproduce. Combine it with dvc exp run and you get a lightweight experiment runner that records each variation as a git-trackable diff.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

MLflow vs DVC vs SageMaker Experiments vs Weights and Biases

A senior interviewer at Databricks or Amazon will rarely accept "we used MLflow" as a final answer. The follow-up is "why not SageMaker Experiments?" or "what would you switch to at 50 engineers?" The table below is the comparison you want loaded before that question lands.

Capability MLflow DVC SageMaker Experiments Weights and Biases
Primary job Experiment tracking and model registry Data and pipeline versioning Tracking inside AWS SageMaker Cloud tracking with rich dashboards
Hosting model Self-host or Databricks managed Self-host, blob storage in S3/GCS Fully managed, AWS only SaaS, on-prem option for enterprise
Data versioning Artifact logs only Native, git-integrated pointers Limited, via S3 paths Artifacts API, not git-native
Pipeline DAG Via Projects, rarely used Native via dvc.yaml Via SageMaker Pipelines Via launch agents
Model registry Yes, stage promotions No Yes, AWS-native Yes, model artifacts
Hyperparameter sweeps Manual or via Optuna Via dvc exp run Hyperparameter tuning jobs Native sweeps with Bayesian search
Free tier reality Free, you run the server Free, you pay S3 Pay per SageMaker hour Free up to ~100 GB, then paid
Best fit DS team that wants open-source plus optional managed Teams with heavy data versioning needs Shops already all-in on AWS Deep learning teams who want pretty dashboards

The honest read from teams shipping ML in production: pick MLflow plus DVC for an open-source stack, SageMaker Experiments plus Pipelines if you live on AWS and want one bill, or Weights and Biases when the DL researchers need the dashboards and the company will pay the SaaS bill. The point of memorizing this table is not to recite it — it is so you can answer "what would you change about your current setup?" with specifics instead of shrugs.

Common pitfalls

The reproducibility traps below are the ones that show up most often in interview post-mortems and code reviews. Each one has burned a real team.

Forgetting the random seed. The most embarrassing failure mode is shipping a model where random_state was never logged. The fix is one line — mlflow.log_param("seed", 42) — but you also need to set it everywhere randomness leaks in: the train/test split, the model itself, NumPy, Python's random, and the DataLoader for PyTorch jobs. Five places, one missed call, and the run is dead on arrival.

Pickling raw model objects instead of using mlflow.<framework>.log_model. Plain pickle.dump(model, ...) loses the conda environment, the input schema, and the framework version. When the load side has scikit-learn 1.4 and you trained on 1.2, you get a silent prediction skew that surfaces three weeks later as a quality regression. Always log through the framework flavor so the loader gets the version pin.

Checking large datasets into git. A 200 MB CSV in git history bloats the repo forever — git clone becomes a coffee break and CI runners run out of disk. DVC exists exactly for this; the fix is dvc add data/*.parquet followed by git add data/*.parquet.dvc and never the raw file. If you have already committed the big file, you also need git filter-repo or BFG to rewrite history.

Manually committing for every experiment. Some teams panic about reproducibility and start git commit-ing for each hyperparameter tweak. That is what MLflow's run_id is for — each run is its own atomic record without polluting git history. Reserve git commits for code changes that survive the experiment.

Not separating Staging from Production in the registry. Deploying "whatever model just trained" is how 3 AM rollbacks happen. The registry stages exist to enforce the gap: CI promotes to Staging, a human approves Production, and serving always reads the Production alias. Skipping this is a fast way to fail a senior MLE loop at any FAANG.

Running mlflow ui against a local SQLite file in a team setting. The default backend is great for solo work and useless for a team — two people cannot edit the same SQLite file over a shared drive without corruption. The production pattern is Postgres for the tracking server plus S3 for artifact storage, deployed once and pointed at by everyone's MLFLOW_TRACKING_URI. If you cannot stand up the server, Databricks managed MLflow is the fast path.

Skipping model signatures. mlflow.models.signature.infer_signature(X_train, y_pred) records the input and output schema with the model. Without it, a serving caller can pass (batch, 12) when the model expected (batch, 14) and the model will happily emit garbage instead of erroring. Signatures are five extra lines and they catch the entire class of upstream-schema-drift bugs.

If you want a drillable bank of ML system-design and DS questions like this one, NAILDD is launching with hundreds of interview prompts organized exactly by topic.

FAQ

Is MLflow overkill for a solo project?

No, and that is part of why it has won. The local file backend takes one import and zero infrastructure, so the marginal cost over not tracking is roughly an hour. The upside shows up on run three when you cannot remember which preprocessing path produced the best score. For a team of two to ten, swap the local store for Postgres plus S3. Past that, managed MLflow on Databricks or AWS removes the ops cost entirely.

Does DVC replace git?

No, it complements it. Git tracks source code and .dvc pointer files; DVC tracks the actual data blobs that those pointers reference. A clean repo has both: git log shows code history and dvc.lock shows data history, and git checkout <commit> && dvc pull reconstructs both halves of the state. If you find yourself running git lfs on parquet files, you have probably reinvented a worse DVC.

What is an MLflow model signature and why does it matter?

A signature is the input and output schema attached to a logged model — column names, dtypes, and shapes. When a downstream service loads the model, MLflow checks incoming requests against the signature and errors loudly on a mismatch. Without it, the loader silently accepts whatever it gets and your prediction quality drifts in ways that are hard to attribute. Use mlflow.models.signature.infer_signature(X_train, model.predict(X_train)) and pass it into log_model.

When should I pick Weights and Biases over MLflow?

When dashboards and built-in sweeps matter more than self-hosting. W&B has nicer visualizations out of the box, easier collaboration links, and Bayesian hyperparameter search wired into the agent. The trade-off is SaaS pricing past the free tier and weaker integration with the model-registry-to-serving handoff. Many deep learning teams use W&B for the research loop and MLflow for the production registry, which is fine.

Should I use SageMaker Experiments if my company is on AWS?

If you are already deep in SageMaker for training jobs and endpoints, yes — keeping the experiment metadata in the same control plane reduces glue code. If you are on AWS but training in Kubernetes or on EC2 with custom containers, MLflow self-hosted on the same VPC is usually easier to wire up and more portable if you ever leave AWS.

Is this content vendor-endorsed?

No. The patterns here are based on the MLflow 2.x and DVC 3.x documentation, plus what teams ship in practice. Tooling moves fast — always check the current docs before betting a production deployment on a specific API.