Hyperparameter tuning at a data science interview
Contents:
Parameters vs hyperparameters
A senior DS interviewer at Stripe or DoorDash usually opens an ML round with the same warm-up: "walk me through how you'd tune an XGBoost model." It looks friendly. It is not. The real test is whether you know why grid search is wrong by default, whether you can name a realistic trial budget for a 5-hyperparameter model, and whether you understand that the validation split is what gets overfit, not just the model.
Start with the vocabulary. Parameters are what the model learns from data: weights in a neural net, coefficients in linear regression, the split thresholds inside a decision tree. Hyperparameters are what you set before training: learning rate, max_depth, hidden-layer count, the regularization strength alpha. The model never touches hyperparameters during .fit() — that is your job, and the interviewer is grading the search strategy you pick.
The pain mode without a plan is familiar. A candidate kicks off GridSearchCV on six parameters with five values each — 15,625 combinations times 5 CV folds — and walks away. Two weeks later the notebook is still running, the deadline is gone, the tech lead asks "why grid?" and the answer is silence.
Load-bearing trick: for any model with more than three meaningful hyperparameters, your default should be random search or Bayesian — not grid. Grid is a teaching aid, not a production technique.
Grid search
Grid search is the exhaustive sweep over a fixed Cartesian product. Easy to demo on a slide, almost never the right tool for a modern boosted tree or neural net.
from sklearn.model_selection import GridSearchCV
param_grid = {
"max_depth": [3, 5, 7, 10],
"min_samples_split": [2, 5, 10],
"n_estimators": [100, 200, 500],
}
gs = GridSearchCV(estimator, param_grid, cv=5, scoring="roc_auc", n_jobs=-1)
gs.fit(X, y)
print(gs.best_params_, gs.best_score_)The arithmetic is the interview tell. That grid is 36 combinations times 5 folds, so 180 fits. Bump to five parameters with five values and you are at 3,125 combinations before cross-validation. On a 500k-row table with XGBoost at 200 trees, a single fit is roughly 30 seconds — that puts the full sweep north of 22 hours of wall time on a single machine.
The deeper problem is geometry, not wall time. On most real datasets only two or three of your hyperparameters actually move the validation score. Grid spends 80% of its budget walking up and down the inert axes. This is the result Bergstra and Bengio formalized in 2012.
Use grid when you have two or three hyperparameters, known good ranges, and a hard reason to want full coverage — a benchmark table for a paper, say. Otherwise pick something else.
Random search and why it usually wins
Random search samples points from a distribution over the hyperparameter space. Same compute budget as grid, dramatically better coverage on the axes that matter.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
param_dist = {
"max_depth": randint(3, 15),
"min_samples_split": randint(2, 20),
"learning_rate": uniform(0.01, 0.3),
}
rs = RandomizedSearchCV(
estimator,
param_dist,
n_iter=50,
cv=5,
scoring="roc_auc",
random_state=42,
)
rs.fit(X, y)The intuition is sharper than the proof. With a 5x5 grid you evaluate 25 models but only 5 unique values per axis. With 25 random draws you get 25 unique values per axis (with noise). If one axis truly drives the score and the others are flat, random samples the live axis 5x more densely for free.
The other reason this is the interview default is reproducibility under a budget. You can tell your manager "I will run 100 trials overnight" and actually hit that number. 50 to 200 iterations is the standard range — under 50 for cheap models with three parameters, over 200 only if a single fit is well under a minute.
Sanity check: if you cannot finish your search in under one wall-clock day on the hardware you actually have, your search space is wrong before your algorithm is wrong. Cut the ranges before you switch tools.
Bayesian optimization with Optuna
Bayesian methods use the trials already evaluated to predict where the next promising point is. The mental model is a surrogate function (Gaussian Process or Tree-structured Parzen Estimator) fit to the (hyperparams) → score pairs, and an acquisition function (Expected Improvement is classic) that proposes the next point.
In production, the tool is Optuna: TPE by default, parallel workers, SQLite or Postgres storage, pruners for early stopping.
import optuna
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
def objective(trial):
params = {
"max_depth": trial.suggest_int("max_depth", 3, 15),
"learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.5, log=True),
"n_estimators": trial.suggest_int("n_estimators", 100, 2000),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
}
model = XGBClassifier(**params, tree_method="hist", n_jobs=-1)
return cross_val_score(model, X, y, cv=5, scoring="roc_auc").mean()
study = optuna.create_study(
direction="maximize",
sampler=optuna.samplers.TPESampler(seed=42),
storage="sqlite:///hp_study.db",
study_name="xgb_v1",
load_if_exists=True,
)
study.optimize(objective, n_trials=100, n_jobs=4)
print(study.best_params, study.best_value)Two notes that signal seniority. First, log-uniform sampling for learning rate is non-negotiable — a linear sweep over [0.001, 0.01, 0.1] skips the entire useful interior. Second, persisting the study with a storage URL means a killed notebook does not destroy 8 hours of work; you reattach with load_if_exists=True and pick up at trial 47.
The trade-off is surrogate overhead. When a single fit is under a few seconds, random with 200 trials often matches Bayesian with 50 in wall time. Bayesian shines when one trial is expensive — a deep network or anything with 5+ hyperparameters and a tight compute budget.
Hyperband and Successive Halving
The fourth family attacks the budget problem differently: spend less compute on bad candidates. Most hyperparameter combinations declare themselves bad within the first epoch or the first 50 trees — why train them for 500 trees to confirm?
Successive Halving is the simple version. Start n candidates with a tiny budget (one epoch, 50 trees), keep the top half by validation score, double the budget, repeat until one survives. Hyperband wraps a meta-loop around Successive Halving that runs several inner-halving configurations with different aggressiveness — this hedges against noisy early signal.
In practice, you rarely run textbook Hyperband. You run Optuna with a pruner: MedianPruner kills any trial whose intermediate score at step t falls below the median of all completed trials at the same step.
study = optuna.create_study(
direction="maximize",
pruner=optuna.pruners.MedianPruner(n_warmup_steps=10),
)Pruning earns its keep when a single trial is long — deep nets, large transformer fine-tunes — and the validation curve is monotonic enough that early epochs predict late ones. It is mostly wasted on a 200-tree XGBoost classifier that finishes in 30 seconds.
Search method comparison
This is the table the interviewer wants on the whiteboard. Memorize the shape.
| Method | Compute per useful trial | Scales to 10+ HPs | Parallelizes well | Typical budget | Best fit |
|---|---|---|---|---|---|
| Grid search | Wasteful — explores inert axes | No, exponential blowup | Trivially | 2-3 HPs, full sweep | Benchmarks, teaching examples |
| Random search | Good baseline coverage | Yes, linearly | Trivially | 50-200 trials | Default for tabular ML |
| Bayesian (Optuna TPE) | Best per-trial efficiency | Yes, with care | Moderate — surrogate is sequential | 30-100 trials | Expensive fits, 5+ HPs |
| Hyperband / Pruning | Best wall-clock when trials are long | Yes | Yes | 100-500 candidates, most pruned | Deep nets, long-trained boosters |
Random search stays the safe default for tabular models where a fit is under a minute. Bayesian dominates when a single trial costs over five minutes and you have at least five hyperparameters in play. Hyperband wins when the validation curve is informative early — neural nets, large LightGBM runs with early_stopping_rounds, anything with epochs.
Practical tuning order for popular models
Tuning order matters more than the algorithm. The orderings below are what senior DS at Meta and Netflix actually use.
For XGBoost, LightGBM, and CatBoost: lock n_estimators at 500 to 1,000 with learning_rate at 0.05 to 0.1, everything else at defaults — that is your baseline. Tune max_depth in 3 to 10 with min_child_weight (XGB) or min_data_in_leaf (LGBM). Then subsample and colsample_bytree in 0.7 to 1.0 for overfit control. Then reg_alpha, reg_lambda. Final move: drop learning_rate to 0.01 to 0.03 and let n_estimators grow with early_stopping_rounds=50.
For Random Forest: n_estimators in 100 to 500 is monotonically better, just slower — no overfit risk from more trees. Tune max_depth, min_samples_split, min_samples_leaf against overfit. max_features='sqrt' for classification, or 0.3 to 0.7 of features for regression.
For logistic and linear regression: regularization strength C (or alpha), the penalty type (l1, l2, elasticnet), and class_weight='balanced' for imbalance. Five values of C on a log scale and a coin flip between l1 and l2 is often the whole search.
For neural networks: learning rate is king. Run an LR finder first, then tune batch_size in 32 to 512, weight_decay around 1e-4, dropout if applicable, and only then move the architecture. Tuning hidden-layer counts before learning rate is the rookie move that breaks an ML system design round.
Common pitfalls
Tuning on the test set. Hyperparameters are chosen on validation (or via cross-validation); the test set is touched exactly once, at the end, for the final number. Re-tune after seeing test performance and you have promoted the test set to a validation set — your generalization estimate is no longer honest. On small datasets, nested CV is the rigorous fix; otherwise a clean three-way split with a locked test holdout is enough.
Not using stratified CV for classification. A plain KFold on an imbalanced binary problem can land a fold with zero positives, making AUC undefined and the search unstable. StratifiedKFold preserves class proportions in each fold. The same logic applies to GroupKFold when rows are not independent — multiple events per user, for example.
Tuning every hyperparameter at once. With ten knobs, even Bayesian struggles. The fix is staging: tune the two or three highest-impact hyperparameters first (learning rate and depth for trees, learning rate and weight decay for nets), freeze them, then move to the second tier. This also produces a tuning log that is much easier to defend than "I gave Optuna 500 trials and trusted it."
Skipping early stopping in boosters. early_stopping_rounds in XGBoost and LightGBM roughly halves compute and protects against overfit-by-trees. The cost is one eval_set argument. Not using it on a 2,000-tree booster is the kind of detail that makes a tech lead ask "have you actually shipped this?"
Ignoring reproducibility. Without a fixed random_state on the model, the CV splitter, and the sampler, two runs of the same search produce different "best" hyperparameters. Pin all three seeds, persist Optuna studies to disk, and log the git commit hash and library versions alongside best_params_ — when the model regresses in production, that log is the only way back.
Related reading
- Bayesian optimization at the data science interview
- Cross-validation strategies at the data science interview
- Decision trees at the data science interview
- Feature engineering at the data science interview
If you want to drill ML interview questions like this every day, NAILDD is launching with a question bank built around exactly this pattern — tuning strategy, trade-offs, and the back-pocket numbers seniors quote without thinking.
FAQ
How many trials are enough for random search?
Between 50 and 200 for most tabular problems. Choose inside that band by per-trial cost: if a fit takes under one minute, push toward 200 because the marginal cost is small. If a fit takes over ten minutes, stop at 50 and switch to Bayesian, where each trial does more work. A useful sanity check is to plot the running best score against trial number — if it has been flat for 30 trials, you are done.
Grid or random search for XGBoost?
Random or Optuna, essentially every time. XGBoost has at least five hyperparameters that materially move validation AUC — max_depth, learning_rate, min_child_weight, subsample, colsample_bytree — and an exhaustive grid is impossible at any useful resolution. The honest exception is a final-pass refit over just n_estimators and learning_rate, where a tight 4x4 grid is fine.
Should learning rate be tuned on a log scale?
Yes, always. Learning rate sensitivity is multiplicative: the gap between 0.001 and 0.01 matters as much as the gap between 0.01 and 0.1. A linear grid like [0.001, 0.01, 0.1] only samples three points across four orders of magnitude. Use log=True in Optuna or loguniform from scipy.stats for RandomizedSearchCV. The same applies to weight_decay, reg_alpha, and reg_lambda.
Can I tune on the same data I evaluate on?
No. The split is train for parameters, validation for hyperparameters, test for the final number. Re-using test during tuning leaks information and inflates your production estimate — sometimes by 2 to 5 percentage points of AUC, enough to ship a model that quietly underperforms its forecast. On small data, use nested cross-validation.
Optuna or scikit-optimize?
Optuna. More ergonomic API, native pruning, persistent storage, parallel workers, and a much more active development cadence. Scikit-optimize still works, but the ecosystem has converged on Optuna for production tuning workflows.
Is this official guidance?
No. The post draws on Bergstra and Bengio (2012) for random search, Snoek et al. (2012) for Bayesian optimization, Li et al. (2017) for Hyperband, and the current Optuna and scikit-learn documentation. Specific number ranges are senior-DS rules of thumb — adjust them to your data and your hardware.