ARIMA in the data science interview
Contents:
Why interviewers still ask about ARIMA
You opened the Zoom, the interviewer at Stripe or Uber smiled, and the third question landed exactly where you expected: "walk me through how ARIMA works and when you'd use it versus Prophet." This question is not nostalgia. Hiring managers ask about ARIMA because it is the cheapest possible way to test whether a candidate understands stationarity, lag structure, residual diagnostics, and the Box-Jenkins methodology — the same vocabulary you need for any forecasting system, classical or neural.
Senior loops at Netflix, DoorDash, and Airbnb push further. You will be asked to defend a model choice on a series with weekly seasonality plus a holiday spike, to explain why your auto_arima returned (2,1,2) rather than (1,1,1), and to read a residual plot out loud. The interviewer is not checking whether you can spell SARIMAX. They are checking whether you can avoid the three classic mistakes that crash forecasts in production: training on a non-stationary series, leaking the future into the past during cross-validation, and ignoring exogenous shocks that the model cannot see.
This guide is the recipe I would memorise if I had a DS interview in seven days. It is built around the questions that actually get asked: what is stationarity, how do you test for it, what do AR, MA, and I mean, how do ACF and PACF pin down the orders, when do you reach for SARIMA, and when is Prophet (or LightGBM with lag features) the more honest answer.
Stationarity and the ADF test
A series is stationary when its statistical properties — mean, variance, autocorrelation structure — do not change with time. ARIMA assumes stationarity on its differenced input. Train it on a trended or seasonal series without differencing and the model will fit the past beautifully and predict the future poorly.
The Augmented Dickey-Fuller (ADF) test is the workhorse for checking this in interviews. The null hypothesis is that the series has a unit root, i.e. is non-stationary. If p-value < 0.05, you reject H0 and call the series stationary. The catch interviewers love to surface: ADF tests for a specific kind of non-stationarity (unit root), not for variance shifts or seasonal trends.
from statsmodels.tsa.stattools import adfuller
result = adfuller(ts)
print(f"ADF statistic: {result[0]:.3f}, p-value: {result[1]:.4f}")Pair ADF with the KPSS test, which flips the null: H0 is stationarity. If both tests agree, you have a strong signal. If they disagree — for example, ADF says stationary and KPSS says not — the series is likely trend-stationary and you should detrend before differencing. Plot the series alongside a rolling mean and rolling standard deviation as a sanity check; if both lines wander, your eyes already know the answer.
When the series fails the test, take a first difference: y_t - y_{t-1}. For many business series this is enough. Strong seasonality may require a seasonal difference at lag m (y_t - y_{t-m}). Avoid over-differencing — d = 2 or higher inflates variance and almost never improves forecasts.
AR, MA, and I components
ARIMA is shorthand for three building blocks: AR(p) autoregression, I(d) integration (differencing), and MA(q) moving average of past errors. The full notation ARIMA(p, d, q) is just the three orders stitched together.
| Component | What it captures | Typical orders |
|---|---|---|
| AR(p) | Dependence on past values | 0-3 |
| I(d) | Number of differencing passes | 0-2 |
| MA(q) | Dependence on past errors | 0-3 |
The AR(p) equation says today is a weighted sum of the previous p values plus noise:
y_t = c + phi_1 * y_{t-1} + phi_2 * y_{t-2} + ... + phi_p * y_{t-p} + eps_tThe MA(q) equation says today is a weighted sum of the previous q shocks plus noise:
y_t = c + eps_t + theta_1 * eps_{t-1} + ... + theta_q * eps_{t-q}ARMA(p, q) is the combination. ARIMA(p, d, q) is ARMA fit on the d-th difference of the series. The reason interviewers ask you to write out the equations: they want to confirm you understand that AR coefficients act on observations while MA coefficients act on residuals — a distinction that matters when you read diagnostic plots.
Picking p, d, q with ACF and PACF
The Autocorrelation Function (ACF) shows the correlation of y_t with y_{t-k} for increasing lags k. The Partial Autocorrelation Function (PACF) shows the same correlation after removing the influence of intermediate lags. Together they fingerprint the process.
Load-bearing trick: PACF cuts off after lag p for a pure AR(p); ACF cuts off after lag q for a pure MA(q); both decay gradually for a mixed ARMA. Memorise this. It is the single most-asked ARIMA question in mid-level DS loops.
In practice almost no real series gives you a clean cutoff. You usually see ambiguous tails on both functions, which is why auto_arima from pmdarima exists. It searches a grid of (p, d, q) by AIC or BIC with a stepwise heuristic and returns the best fit.
from pmdarima import auto_arima
model = auto_arima(
ts,
seasonal=False,
stepwise=True,
information_criterion="aic",
suppress_warnings=True,
)
forecast = model.predict(n_periods=30)When the interviewer asks how you actually pick orders in a real project, give the honest answer: look at ACF/PACF for intuition, run auto_arima for the grid, then validate the top three candidates on a held-out window and pick the one with the lowest out-of-sample RMSE or MAE, not the lowest AIC. AIC ranks candidates; backtesting ranks honest candidates.
SARIMA and SARIMAX
Most business series have seasonality — daily logins peak on Tuesdays, e-commerce orders spike on weekends, support tickets cluster on Monday mornings. SARIMA extends ARIMA with a seasonal block: SARIMA(p, d, q)(P, D, Q, m) where m is the seasonal period and (P, D, Q) are the same three orders applied at the seasonal lag.
| Cadence | Suggested m |
|---|---|
| Hourly with daily seasonality | 24 |
| Daily with weekly seasonality | 7 |
| Daily with yearly seasonality | 365 (use Fourier terms instead) |
| Monthly with yearly seasonality | 12 |
| Quarterly with yearly seasonality | 4 |
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(
ts,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12),
).fit(disp=False)SARIMAX is the same machine plus exogenous regressors. If you want to forecast website orders and you know the discount rate, ad spend, or a binary promo flag in advance, plug them in through the exog argument. Be honest with the interviewer about the limit: SARIMAX only helps if you genuinely know the future values of the exogenous variables; otherwise you are forecasting twice and compounding error.
Prophet and modern alternatives
Prophet, originally from Meta, decomposes a series additively:
y(t) = trend(t) + seasonality(t) + holidays(t) + eps(t)The trend is piecewise linear or logistic with automatic changepoint detection. Seasonality is a Fourier series. Holidays are user-supplied calendars. The model is robust to missing values, handles outliers gracefully, and exposes its components for inspection.
from prophet import Prophet
model = Prophet(yearly_seasonality=True, weekly_seasonality=True)
model.fit(df) # df with columns 'ds' (date) and 'y' (value)
forecast = model.predict(future)| Approach | Best fit | Watch out for |
|---|---|---|
| ARIMA / SARIMA | One series, short history, clean seasonality | Non-stationarity, structural breaks |
| Prophet | Daily business series with holidays | Sub-daily data, sparse counts |
| LightGBM + lag features | Thousands of series, rich exogenous data | Cold start, no native uncertainty |
| N-BEATS / TFT | Large panels, GPU available | Compute cost, harder to debug |
| DeepAR | Probabilistic forecasts at scale | Library churn, needs lots of data |
darts and GluonTS are the two Python frameworks worth naming if the interviewer wants to know what you would actually try in production. For a handful of high-value series, ARIMA, Prophet, or ETS are still very hard to beat. For thousands of SKUs with rich features, gradient boosting on lag features is the new default at Amazon, Doordash, and similar shops.
Common pitfalls
The most common mistake is fitting ARIMA on a non-stationary series without differencing. You will get a model that hugs the training set and drifts off the rails on the first forecast horizon. The fix is to run ADF (and KPSS), apply differencing until the series passes, and only then fit. If you find yourself differencing twice, stop and ask whether a log transform or a structural model is more honest.
A subtler trap is training on the full series and reporting in-sample fit. Time series do not tolerate random shuffling: you must use a strict expanding-window or sliding-window cross-validation that preserves order. Random k-fold on time series leaks the future into training and gives you a model that brags about RMSE in the notebook and embarrasses you in production. This is the single most common reason a model that "worked" in the candidate's portfolio collapses on a take-home extension.
Ignoring seasonality turns an easy problem hard. If your daily series has a clear seven-day cycle, ARIMA(p, d, q) without a seasonal block will leave a periodic ripple in the residuals. Switch to SARIMA with m = 7, or add explicit seasonal dummies as exogenous regressors. The residual ACF plot is your friend here — if you see spikes at lags 7, 14, 21, you have seasonal autocorrelation that the model failed to absorb.
Another classic is testing residuals only with Ljung-Box. That test checks for residual autocorrelation, which is necessary but not sufficient. You also want to plot residuals over time, look at a Q-Q plot for normality, and check that variance does not grow with the fitted values. A clean Ljung-Box on heteroskedastic residuals is a false comfort.
Finally, comparing models only by AIC is a junior move. AIC is a relative ranking on the training set; it does not measure forecast quality. Always backtest with out-of-sample MAE, RMSE, or MAPE on a held-out window that mimics how the model will actually be deployed — same horizon, same refresh cadence, same exogenous availability.
Related reading
- Time series feature engineering for the DS interview
- Cross-validation strategies for the DS interview
- How to calculate autocorrelation in SQL
- How to calculate Holt-Winters in SQL
- How to calculate exponential smoothing in SQL
If you want a structured drill for forecasting and DS interview questions like this, NAILDD ships with hundreds of problems across exactly this pattern — stationarity, ACF/PACF reading, SARIMA configuration, and residual diagnostics.
FAQ
Is ARIMA still relevant in 2026 with deep learning forecasters available?
Yes, and interviewers know it. For a single business series with under a few thousand observations, ARIMA, SARIMA, and ETS routinely beat neural models because they have fewer knobs to tune and stronger inductive bias. Deep learning wins when you have thousands of related series and abundant exogenous features. Saying "ARIMA is obsolete" in an interview is a red flag; saying "I default to ARIMA or Prophet for single series and reach for gradient boosting or N-BEATS when I have a large panel" is the honest answer.
Can ARIMA do multi-step forecasts?
Yes, through direct forecasting (predict t+h in one call) or iterative forecasting (predict t+1, feed it back, predict t+2). Accuracy degrades with horizon because forecast errors compound. For long horizons, consider hybrid approaches: ARIMA for short-term plus an ETS or regression model for the trend at longer horizons.
What are exogenous variables and when do they help?
Exogenous variables are external drivers — ad spend, weather, promo flags, macro indicators — that you feed alongside the target series. SARIMAX accepts them via the exog argument. They help when you have a believable causal link and you know future values of the exogenous variable. Forecasting orders using future ad spend that you control is fair; forecasting using future weather that you must also predict is a trap that doubles your error.
What about intermittent demand (lots of zeros)?
Standard ARIMA assumes a smooth Gaussian-like process and behaves badly on sparse series with many zeros, like spare-parts demand or rare events. Use Croston's method, TSB, or zero-inflated models instead. If you have rich features, gradient boosting with a Tweedie or zero-inflated objective often wins.
How do I detect structural breaks in the series?
Look at rolling mean and rolling variance plots, run a Chow test at suspected break dates, or use changepoint detection libraries like ruptures and Prophet's built-in changepoints. If a break is real and recent, retrain on post-break data only — extending the window to pre-break history will bias your forecast toward a regime that no longer exists.
Is this official guidance?
No. This guide is built on the classical Box-Jenkins framework ("Time Series Analysis: Forecasting and Control"), the statsmodels and pmdarima documentation, and Prophet's published methodology. Any errors are mine, not theirs.