Feature engineering on the data science interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why interviewers keep asking

When a Stripe or DoorDash loop ends in a live whiteboard, the deciding question is rarely "which gradient boosting library do you prefer". It is "walk me through how you built features for this dataset". Feature engineering carries more signal about judgement than algorithm choice — picking XGBoost over LightGBM rarely costs 5 points of AUC, but target-encoding without out-of-fold splits loses the entire model.

The questions cluster: how do you encode a city column with 40,000 levels, why cyclic features for hour-of-day, when scaling matters, and how do you avoid leaking the label. Most have one right answer and three traps. This post walks each cluster in the order a typical ML pipeline covers them.

Load-bearing trick: every transformer that learns from data — scalers, encoders, imputers — must be fit on the training fold only and transformed on validation and test. Breaking this rule is the single most common reason a candidate gets a polite rejection.

Scaling and normalization

Three scalers cover 95% of tabular work. StandardScaler centers each column to mean zero, unit variance — what linear models, neural nets, k-means, and PCA assume. MinMaxScaler squashes to [0, 1] for image pixels or sigmoid inputs. RobustScaler uses median and IQR, the right call when the column has heavy outliers.

Tree-based models — random forests, gradient boosting, CatBoost — are scale-invariant. A split on income > 70000 is identical to a split on z > 0.4. Mention this proactively when asked; it shows you understand what the model does. The classic mistake is dropping a StandardScaler into a pipeline that ends in GradientBoostingRegressor.

Model family Needs scaling? Why
Linear / logistic regression Yes Coefficients become incomparable; L1/L2 penalties skewed
Neural networks Yes Optimizer struggles with mixed-magnitude inputs
k-NN, k-means, PCA Yes Distance and variance are scale-dependent
SVM (RBF) Yes Kernel bandwidth assumes comparable feature ranges
Random forest, gradient boosting No Split thresholds are rank-based
Naive Bayes (Gaussian) Optional Per-feature parameters absorb scale

Encoding categoricals

Categorical encoding is where most candidates lose interview points, because the right choice depends on cardinality, the downstream model, and whether the target is involved. The table below is what I expect a strong candidate to draw on a whiteboard within thirty seconds.

Method Best for Cardinality Risk
One-hot Linear, NN, small cardinality < 50 levels Sparse explosion at 10k+ levels
Label / ordinal Trees with truly ordered categories Any Imposes false order on nominal data
Target (mean) High-cardinality (city, zip, sku) 100 – 1M Leakage without out-of-fold splits
Frequency Trees, when popularity correlates with target Any Two unrelated rare levels collide
Weight of evidence (WoE) Binary classification, credit scoring Moderate Needs binning; unstable for rare classes
Hashing Streaming, online learning, > 1M levels Unbounded Random collisions, irreversible
Native (CatBoost / LightGBM) Trees with mixed categorical/numeric Any Library-specific; verify the column dtype

One-hot turns "red" into [1, 0, 0] — safe default for linear models under fifty levels. Label encoding maps "red" → 0, "blue" → 1 — fine for trees, breaks linear models by inventing order. Target encoding replaces each level with the conditional target mean — strong on high-cardinality like zip_code, but it can leak the label without out-of-fold splits. Frequency encoding swaps each level for its dataset share. WoE maps each bin to log(P(good | bin) / P(bad | bin)) — the canonical move for credit scoring. Hashing is the right answer for user-agent strings or URLs at scale.

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# One-hot — safe default for linear models, small cardinality
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=True)
X_ohe = ohe.fit_transform(train[["color"]])

# Frequency — handy for trees on high-cardinality columns
freq = train["city"].value_counts(normalize=True)
train["city_freq"] = train["city"].map(freq)
test["city_freq"] = test["city"].map(freq).fillna(0.0)

Sanity check: if your one-hot matrix has more columns than rows, you picked the wrong encoder. Move to target, frequency, or hashing.

Datetime and cyclic features

Raw timestamps are useless to most models. First, extract calendar pieces — year, month, day, dayofweek, hour, is_weekend, is_holiday. Second, recognize that several are cyclic: hour 23 is one step from hour 0, but as integers they sit 23 units apart. Linear models and neural nets treat them as opposites; trees handle it via two-sided splits.

df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)
df["dow_sin"]  = np.sin(2 * np.pi * df["dayofweek"] / 7)
df["dow_cos"]  = np.cos(2 * np.pi * df["dayofweek"] / 7)

Year is not cyclic. Month and day-of-week are. Day-of-year is cyclic if the phenomenon is seasonal, less so for a trending metric. Encoding a non-cyclic feature with sin/cos just adds noise.

Lag and rolling features

For time series, the lift comes from telling the model what the recent past looked like. Lag features shift backward, rolling features summarize a window. Both must respect time order — no random k-fold split.

df = df.sort_values(["user_id", "ts"])
df["lag_1"]            = df.groupby("user_id")["amount"].shift(1)
df["lag_7"]            = df.groupby("user_id")["amount"].shift(7)
df["rolling_mean_7"]   = df.groupby("user_id")["amount"].shift(1).rolling(7).mean()
df["rolling_std_30"]   = df.groupby("user_id")["amount"].shift(1).rolling(30).std()

Notice the shift(1) before rolling — without it, the window includes the row you are predicting. That is leakage hidden in pandas idioms. Aggregations per group — average order value over 30 days per user, refund rate by merchant over the last week — are usually the strongest single feature you ship.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Target encoding and leakage

Target encoding is both the highest-lift trick for high-cardinality columns and the most common source of catastrophic leakage. The naive version df.groupby("city")["target"].transform("mean") is wrong because each row contributed its own label to the mean. On train it looks magical; on test it collapses.

The fix is out-of-fold encoding. Split into k folds, and for each fold compute the encoding using only the other k-1 folds.

from sklearn.model_selection import KFold

oof = np.zeros(len(X))
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for tr, va in kf.split(X):
    means = X.iloc[tr].groupby("city")["target"].mean()
    oof[va] = X.iloc[va]["city"].map(means)

# For the test set, use the mean computed on the full training set
full_means = X.groupby("city")["target"].mean()
X_test["city_te"] = X_test["city"].map(full_means).fillna(X["target"].mean())

Two follow-ups always come up. First, smoothing: a city with two observations gives a noisy mean. Blend with the global mean using (n * local + alpha * global) / (n + alpha)alpha = 10 to 30 is a sane band. Second, CatBoost does this automatically with ordered target encoding, one reason it wins on tabular contests with many categoricals.

Feature selection

Once you have 200 features, removing dead weight matters for training time, interpretability, and overfitting. Three families: filters (univariate — correlation, mutual information — fast but blind to interactions), wrappers (RFE, forward selection — accurate but slow), and embedded (L1, feature_importances_, SHAP — usually the right starting point).

The pragmatic recipe: drop near-zero variance columns, drop one of any pair with correlation above 0.95, train a baseline GBDT and inspect feature_importances_, then confirm with permutation importance — GBDT importance is biased toward high-cardinality features, while permutation scrambles a column and measures the honest AUC drop. Drop the bottom, refit, repeat.

Common pitfalls

The most expensive mistake is fitting any transformer on the union of train and test. A scaler fit on all data has seen the test mean and variance; an encoder has seen which levels appear in test. Fit on the training fold only, transform on every other partition. This applies to imputers, scalers, encoders, and any learned transform.

The second is target encoding without out-of-fold splits. The train feature looks magical, the test feature is nearly random. Write the fold loop or use category_encoders.TargetEncoder with cv=5. A related trap is computing aggregations like "user's mean order value" with rows from after the prediction timestamp; always filter to ts < prediction_ts first.

A third pitfall is building lag features with random k-fold cross-validation. A random split puts a January row in the validation fold while its lag-7 feature sees rows from later in February. CV looks great, production collapses. Use TimeSeriesSplit or a manual rolling-origin scheme for any temporal data.

The fourth is one-hot encoding high-cardinality columns. A merchant_id with 100k levels becomes 100k mostly-zero columns; logistic regression chokes, tree libraries crawl through the sparse matrix, and the model rarely beats a simple frequency or target encoder. Switch encoders by cardinality.

The fifth is trusting feature_importances_ blindly. The default GBDT importance counts split frequency and is biased toward high-cardinality features — a user-id column scores high because the tree can keep splitting on it, not because it generalizes. Use permutation importance or SHAP for any decision that costs money.

The sixth is ignoring missing values as a signal. A NaN in last_login_ts for a churn model is not noise — it usually means the user never logged in. Create an is_missing indicator before you impute, and let the model decide which carries more information.

If you want to drill feature-engineering problems like these end to end, NAILDD is shipping with hundreds of DS interview questions across encoding, leakage, and time-aware splits.

FAQ

When does feature engineering not matter?

For pure deep learning on raw inputs — images, audio, free text — the network learns its own representation and hand-crafted features rarely help. On tabular data with mixed numeric and categorical columns, feature engineering is almost always the deciding factor; gradient-boosted trees on well-engineered features routinely beat neural networks on the same data, which is why Kaggle tabular contests are still won by feature work, not by architecture.

What is a feature store and when do I need one?

A feature store versions feature definitions and serves them at both training and inference time from the same code path. Feast and Tecton are the open-source and managed options. You need one when more than one model in production shares features, or when you find drift between training and serving — the store guarantees the logic that computed user_avg_order_value_30d for training is the same logic that serves it at request time.

How do I choose between target encoding and one-hot?

Cardinality decides. Under fifty levels, one-hot is the safe default for linear and NN. Between fifty and a few thousand, out-of-fold target encoding with smoothing usually wins for any model family. Above a few thousand — user IDs, URLs, merchant identifiers — target encoding still works but hashing or learned embeddings start being worth the complexity, especially in streaming systems.

Should I scale before or after the train/test split?

After. Always after. Fit on the training fold, transform on validation and test. Fitting on the full dataset leaks the test distribution into training, inflates the offline score, and underperforms in production. Wrap transforms in a scikit-learn Pipeline so the framework enforces the order.

How do I avoid leakage with rolling aggregations?

Sort by time, then shift before you roll: df.groupby(key)["x"].shift(1).rolling(window).mean(). The shift(1) excludes the current row. If features depend on the label, build them out of fold the same way you would target-encode. Time-aware cross-validation surfaces this kind of leakage quickly; random k-fold hides it.