Time series feature engineering for the data science interview
Contents:
Why feature engineering still wins on time series
Walk into any data scientist forecasting loop at Stripe, Uber, or DoorDash and you will hear the same first question: "How would you engineer features for this series before you even pick a model?" The interviewer is not testing whether you know XGBoost; they want to see whether you can turn a raw timestamp column into something a gradient-boosted tree can actually learn from. Lag features, rolling statistics, and Fourier seasonality are the load-bearing primitives, and the candidates who land staff offers can sketch all three on the whiteboard inside five minutes.
The reason feature engineering dominates is that most production forecasters at scale are still tree ensembles, not deep nets — LightGBM and XGBoost win M5-style competitions because they can ingest hundreds of hand-crafted features cheaply, while a vanilla LSTM gets the seasonality "for free" but loses the calendar and event signal. Every senior interviewer knows this. They want to hear you say that a flat tree model with lag_7 + rolling_28 + day_of_week + is_holiday beats a naive LSTM on retail demand 9 times out of 10, and then show the code.
Load-bearing trick: every time-aware feature must be computed using only data strictly before time t. Get the .shift(1) before any rolling aggregate and you will pass every leakage gotcha the interviewer throws at you.
The feature taxonomy interviewers expect
Before writing a single line of pandas, sketch the taxonomy. Interviewers grade structure: a candidate who lists six families gets credit even if the code is rough, while a candidate who jumps straight into rolling().mean() looks like they only know one trick. Memorize this table — it is the spine of every forecasting feature-engineering answer.
| Family | What it captures | Typical window | Leakage risk | Example feature |
|---|---|---|---|---|
| Lag | Direct past value | 1, 7, 14, 28, 365 | High if no shift | sales.shift(7) |
| Rolling | Local trend / volatility | 7, 14, 28, 90 days | High without shift before window | sales.shift(1).rolling(28).mean() |
| Fourier | Smooth periodic cycles | period = 7, 365.25 | Low | sin(2*pi*t/365) |
| Seasonality | Categorical position in cycle | day-of-week, month | Low | dayofweek, weekofyear |
| Calendar | Workday vs weekend, payday | n/a | Low | is_weekend, is_payday |
| Holiday | Country-specific events | ±3 day window | Low if forward-known | is_thanksgiving, days_to_xmas |
Notice that the first two families carry all the leakage risk and the bottom four are essentially safe. If you only have time for one sentence in the interview, it is this: lag and rolling must be shifted, everything else is just deterministic calendar math.
Lag features done right
Lags are the simplest and most under-explained feature class. The candidate reflex is to write df['lag_1'] = df['sales'].shift(1) and stop. The senior answer adds three things: multiple horizons matched to the seasonality, per-entity grouping, and a proof of no leakage.
import pandas as pd
LAGS = [1, 7, 14, 28, 365]
def add_lag_features(df: pd.DataFrame, group_col: str, target: str) -> pd.DataFrame:
df = df.sort_values([group_col, "ts"])
for lag in LAGS:
df[f"{target}_lag_{lag}"] = (
df.groupby(group_col)[target].shift(lag)
)
return dfThree things to call out at the whiteboard. First, sort then group — pandas does not guarantee order inside groups otherwise, and a silent re-shuffle is the most common leakage source on interviews. Second, choose lags that match the seasonalities of the business: retail wants 1, 7, 28, 365; ride-share wants 1, 24, 168 at hourly grain. Third, accept that lags create NaNs at the head of every series; either drop them or impute with the series mean, and tell the interviewer which and why.
Rolling and expanding statistics
Rolling windows summarize recent dynamics — trend via mean, volatility via std, regime via max/min. The critical detail interviewers probe: shift before you roll, not after. A 7-day rolling mean computed on df['sales'] directly includes today's sales in the window, which is leakage for any model predicting today. The clean pattern is shift(1).rolling(7).
def add_rolling_features(df, group_col, target, windows=(7, 14, 28)):
g = df.groupby(group_col)[target]
for w in windows:
df[f"{target}_roll_mean_{w}"] = g.shift(1).rolling(w).mean()
df[f"{target}_roll_std_{w}"] = g.shift(1).rolling(w).std()
df[f"{target}_roll_max_{w}"] = g.shift(1).rolling(w).max()
return dfExpanding windows — from the start of the series to t-1 — are useful for slow-moving baselines and customer-level cumulative sums (lifetime orders, lifetime sessions). They are less informative for fast-changing series because old data dominates, but they are exactly what you want for new-vs-returning lift features or stable per-customer means. Mention both windows and ask the interviewer which one the business question favors; they will appreciate the framing question.
Gotcha: computing rolling(7).mean() then shifting the result by 1 day looks equivalent but it isn't. The first version uses today's value as the right edge of yesterday's window, which still leaks. Always shift the raw series first.
Seasonality, Fourier, and cyclic encoding
There are two ways to encode the position inside a cycle: categorical (day_of_week as 0-6) and cyclic (sin/cos pairs). Trees handle the categorical encoding fine; linear models, neural nets, and any model that treats numeric features as ordered need the Fourier transform so that "Sunday → Monday" is the same distance as "Wednesday → Thursday".
import numpy as np
def add_cyclic(df, col, period):
df[f"{col}_sin"] = np.sin(2 * np.pi * df[col] / period)
df[f"{col}_cos"] = np.cos(2 * np.pi * df[col] / period)
return df
df["hour"] = df["ts"].dt.hour
df["dayofweek"] = df["ts"].dt.dayofweek
df["dayofyear"] = df["ts"].dt.dayofyear
df = add_cyclic(df, "hour", 24)
df = add_cyclic(df, "dayofweek", 7)
df = add_cyclic(df, "dayofyear", 365.25)For multi-period seasonality (a daily cycle inside a yearly cycle, common in energy demand) stack multiple Fourier orders — sin/cos with periods 365.25, 182.625, 91.3 etc. Three harmonics usually capture annual demand inside 1-2% MAPE of what a full STL decomposition would give you, at a fraction of the inference cost. The Prophet library does exactly this under the hood; if you mention that connection on the call you instantly signal real production experience.
Calendar and holiday features
Calendar features are deterministic and known into the future, which makes them safe and powerful. is_weekend, is_month_start, is_month_end, is_quarter_end, and days_to_payday all extend cleanly into the forecast horizon. Holiday features are the same idea with a country-specific calendar.
import holidays
us_holidays = holidays.country_holidays("US")
df["is_holiday"] = df["date"].apply(lambda d: d in us_holidays)
df["is_weekend"] = df["dayofweek"].isin([5, 6])
df["days_to_holiday"] = df["date"].apply(
lambda d: min((h - d).days for h in us_holidays if h >= d)
)Two upgrades take this from junior to staff. First, encode days_to_holiday and days_from_holiday as signed integers rather than a binary flag — pre-Christmas demand at Amazon ramps for 14 days before Dec 25, and a flag only on Dec 25 misses the entire effect. Second, encode custom business events the same way: marketing campaigns, product launches, Super Bowl Sunday for food delivery, payroll dates for fintech.
df["campaign_active"] = df["date"].between("2026-05-01", "2026-05-15")
df["days_since_launch"] = (df["date"] - pd.Timestamp("2026-03-01")).dt.daysThis is also why "we use Prophet" is not a complete answer — Prophet handles US holidays out of the box, but the campaign calendar lives in your marketing team's spreadsheet, and joining that in is the work.
Cross-series and hierarchical features
When individual series are sparse — a new SKU with two weeks of history, a new merchant on a payments platform — borrow strength from related series. The three patterns to memorize: hierarchical roll-up, geographic neighborhood, and cluster-mean.
# Hierarchical: category-level features as cold-start signal
cat_daily = (
df.groupby(["category", "date"])["sales"]
.sum().reset_index()
.rename(columns={"sales": "cat_sales"})
)
df = df.merge(cat_daily, on=["category", "date"])
df["cat_sales_lag_7"] = df.groupby("category")["cat_sales"].shift(7)For geography, average sales across the 5 nearest stores by lat/lon. For cluster-mean, run k-means on per-item profiles (price tier, category, lifecycle stage) and broadcast cluster averages back to each row. All three are forms of empirical Bayes: when the item-level signal is noisy, you partially pool toward the group mean, and your forecast variance drops by 20-40% on cold-start items without hurting the warm ones.
Common pitfalls
Leakage through unsorted groups is the single most common failure on interview whiteboards. A candidate writes df.groupby('sku')['sales'].shift(1) without first sorting by (sku, ts), and the shift mixes records across time. The fix is df.sort_values(['sku', 'ts']) before any lag or rolling call, every time. If your training MAPE is suspiciously low — under 3% on a noisy retail series — leakage is your first suspect.
Mismatched lag horizons is the second trap. If your forecast horizon is 28 days, every lag shorter than 28 is unusable in production because you don't yet have those values at inference time. Interviewers will ask "okay, you have lag_7 — how do you compute it 28 days ahead?" The honest answer is that you can't, unless you forecast recursively (predict day 8 with lag_7, then use that prediction as lag_7 for day 15). Recursive forecasting compounds error, so for long horizons many teams instead train direct multi-step models with only lag_28 and longer.
Over-engineered Fourier orders look smart but overfit. A 10th-order Fourier basis on annual seasonality has 20 features (sin/cos pairs) and almost always memorizes the training residuals. Cap at 3-4 harmonics per period and verify on a holdout that adding the 5th does not hurt out-of-sample MAPE. The interviewer wants to hear "I cross-validated the Fourier order with TimeSeriesSplit", not "I used 10 because more is better".
Treating holiday flags as binary misses 90% of the holiday effect. Demand for delivery on Thanksgiving itself crashes; demand the two days before Thanksgiving spikes. A single is_thanksgiving flag captures none of the surrounding ramp. Use days_to_holiday (signed integer, clipped to ±14) and let the tree split on the asymmetric window.
Aggregating cross-series features without lag re-introduces leakage at the group level. category_sales_today includes the SKU you are trying to predict, plus its siblings, all of which include today's value. Lag the aggregate by at least one period the same way you lag the raw series.
Related reading
- Feature engineering — data science interview
- How to detrend a time series in SQL
- How to calculate Holt-Winters in SQL
- How to calculate forecast bias in SQL
- Cross-validation strategies — data science interview
If you want to drill forecasting feature-engineering questions like this every day, NAILDD is launching with hundreds of DS interview problems across exactly this pattern.
FAQ
Do I need lag features if I use an RNN or Transformer?
Technically no — recurrent and attention-based models learn temporal dependencies from the raw sequence. In practice yes, because explicit lag features sharpen the signal and dramatically reduce the data the network needs to converge. The M5 winners and most production forecasting stacks at retail-scale companies still use tree ensembles fed by hand-crafted lags. Treat deep learning on time series as a complement, not a replacement, until you have at least a year of high-frequency data per series.
How do I pick which lag horizons to include?
Start from the known seasonalities of the domain: daily (lag 1), weekly (lag 7 for daily data, lag 168 for hourly), monthly (lag 28-30), annual (lag 365). Then add a few odd lags — 14, 21 — only if a TimeSeriesSplit cross-validation shows they help. The trap is throwing in 30 lags hoping the model sorts it out; trees handle redundancy fine but you waste compute and risk leakage through one of them.
What's the difference between rolling and expanding windows in this context?
A rolling window has fixed length and slides forward (last 28 days), capturing recent local behavior. An expanding window grows from the start of the series to time t-1, capturing a long-run baseline. Rolling is what you want for fast-changing demand series; expanding is what you want for stable per-customer or per-account aggregates (lifetime spend, all-time visit count). Many production feature sets include both.
When should I use Fourier features instead of one-hot encoded day-of-week?
Use Fourier when the model is linear or neural — anything that treats numeric inputs as ordered. Use categorical or one-hot when the model is tree-based — LightGBM and XGBoost split on category indices natively and gain nothing from sin/cos. A safe default is to include both and let regularization or feature importance drop the redundant ones; the storage cost is negligible.
How do I avoid leakage when I scale features?
Fit any scaler — StandardScaler, RobustScaler, target encoder — only on data strictly before the validation cutoff, then transform both train and validation. The cleanest pattern is a scikit-learn Pipeline wrapped in a TimeSeriesSplit cross-validator, which guarantees the scaler sees only past data on each fold. Global fit_transform on the full dataset before splitting is the most common leakage bug in feature-engineering interview takehomes.
Are lag features useful for irregular time series?
Less so. Lag features assume a regular grid; if your events are clickstream-style with millisecond timestamps, you need to resample to a fixed frequency first or switch to time-since-last-event features. The latter is its own family — seconds_since_last_login, events_in_last_5_min — and pairs well with rolling counts over fixed windows. Interviewers love to see candidates recognize that a clickstream is not the same problem as daily retail demand.