XGBoost vs LightGBM vs CatBoost on the data science interview
Contents:
Why this question keeps coming up
Gradient boosting on decision trees is still the top-1 model family for tabular data — on Kaggle, in production at Stripe, Uber, DoorDash, and inside almost every churn or fraud pipeline at Netflix and Airbnb. Even with deep learning eating most other domains, structured tables stay a boosting playground. That is why the question "XGBoost vs LightGBM vs CatBoost — when do you pick which?" appears in 4 out of 5 data science interviews where the candidate claims tabular ML experience.
Interviewers are not looking for a marketing comparison. They want to see that you can name the growth strategy (level-wise, leaf-wise, symmetric), the categorical handling (none, histogram bucketed, ordered target encoding), and one number per library where it actually wins. A candidate who answers "they are all gradient boosting, I just try all three and pick the best CV score" is not wrong — but they are not getting the offer either. The strong answer is structured around three axes: dataset size, categorical cardinality, and the cost of tuning time.
This guide gives you that structured answer plus the small set of parameters worth memorising before the loop.
XGBoost in one breath
XGBoost (Chen, 2014) was the library that made gradient boosting a household name. It introduced explicit L1/L2 regularisation in the objective, sparse-aware splits that handle missing values natively, cache-aware memory layout, and a robust distributed training mode. The growth strategy is level-wise — at each iteration the tree expands every node on the current depth before going deeper. That makes the tree shape predictable and the training stable, but occasionally suboptimal because the splits with the largest gain may sit on one branch only.
XGBoost is the safe middle ground. Pick it when the dataset is medium-sized (hundreds of thousands of rows to a few million), when your stack already has it, or when interpretability tooling matters — SHAP, feature importance, partial dependence are all rock-solid because XGBoost has the longest production history. This is also why most legacy churn and credit-risk models you inherit will be XGBoost. Native categorical support landed only recently and is still less polished than the other two libraries.
LightGBM in one breath
LightGBM (Microsoft, 2017) was engineered around one goal: be faster than XGBoost on large data without losing accuracy. The two big ideas are histogram-based splits — bucketise each continuous feature into 255 bins so the split search runs over bins instead of raw values — and leaf-wise growth, which expands the single leaf with the largest gain regardless of depth. That makes deeper, more informative trees per iteration, but it also makes overfitting easier on small datasets.
On top of that LightGBM ships GOSS (Gradient One-Side Sampling), which keeps all large-gradient rows and subsamples the small-gradient tail, and EFB (Exclusive Feature Bundling), which bundles mutually exclusive sparse features into a single dense one. Together these turn a 50-million-row training run from 4 hours into 25 minutes on the same hardware. Categorical features are supported via the categorical_feature argument and are handled internally with a smoothed target encoding.
CatBoost in one breath
CatBoost (2017) bet the farm on one thing: handling categorical features without leaking the target. The trick is ordered boosting — for each row, the target statistic used for encoding is computed only from rows that appear earlier in a random permutation. That removes the leakage path that breaks naive target encoding on validation. It also uses symmetric (oblivious) trees, where every node at a given depth uses the same split. Symmetric trees are weaker per node but vectorise beautifully, so inference is up to 8x faster than XGBoost at equivalent accuracy.
The other win is defaults. CatBoost ships sensible learning rates, regularisation, and depth out of the box, so a 5-line script often beats a fully tuned XGBoost. That matters more than people admit on the interview: time-to-baseline is a real production constraint.
Feature comparison table
Memorise this table before the screen. If you can reproduce four out of six rows under interview pressure, you are ahead of most candidates.
| Dimension | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Algorithm | Gradient boosting, level-wise | Gradient boosting, leaf-wise, histogram | Gradient boosting, ordered, symmetric trees |
| Categorical handling | None native (one-hot or target encode upstream) | Histogram + smoothed target encoding | Ordered target encoding, leak-safe |
| Growth strategy | Level-wise (depth-first balanced) | Leaf-wise (best-first, can go deep) | Symmetric / oblivious (all nodes at a depth share split) |
| Training speed | Baseline | 2-10x faster on large data | Comparable to XGBoost, slower than LightGBM |
| GPU support | Yes, mature | Yes, mature | Yes, often the fastest on dense GPU |
| Regularisation | L1 + L2 in objective, min_child_weight |
L1 + L2, min_data_in_leaf, GOSS |
L2, ordered boosting, per-feature bayesian noise |
Load-bearing trick: the growth strategy is the single most decision-relevant property. Level-wise (XGBoost) underfits less on small data, leaf-wise (LightGBM) overfits faster on small data but dominates on large data, symmetric (CatBoost) trades a bit of capacity per node for fast inference and stable training.
Categorical features done right
This is the part interviewers probe hardest, because it is where naive pipelines silently leak the target.
XGBoost has no fully native categorical mode (the experimental flag exists but few teams trust it in production). You have to encode upstream — one-hot for low cardinality, target encoding with out-of-fold smoothing for high cardinality.
import pandas as pd
import xgboost as xgb
df = pd.get_dummies(df, columns=["country", "device"])
model = xgb.XGBClassifier(n_estimators=500, max_depth=6, learning_rate=0.05)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50)LightGBM accepts the categorical column directly and bucketises it with a smoothed target encoding.
import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=500, num_leaves=63, learning_rate=0.05)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
categorical_feature=["country", "device"],
callbacks=[lgb.early_stopping(50)],
)CatBoost does the cleanest job — pass the column names, and ordered target encoding runs inside the trainer with no leakage.
from catboost import CatBoostClassifier
model = CatBoostClassifier(
iterations=1000,
depth=6,
learning_rate=0.05,
cat_features=["country", "device"],
early_stopping_rounds=50,
verbose=False,
)
model.fit(X_train, y_train, eval_set=(X_val, y_val))For a column with 5,000+ unique values (think user_id, merchant_id, zip_code), CatBoost typically beats hand-rolled target encoding by 2 to 5 ROC-AUC points on the holdout. That is the number to drop on the interview if asked for a concrete win.
Tuning the parameters that matter
Across all three libraries the same five knobs do 90 percent of the work. Memorise their ranges and stop there for a first pass.
n_estimators 100 - 5000 (use early stopping, don't tune by hand)
learning_rate 0.01 - 0.1 (lower = more trees needed, more stable)
max_depth 3 - 10 (XGBoost, CatBoost)
num_leaves 15 - 127 (LightGBM equivalent of max_depth)
subsample 0.7 - 1.0 (stochastic, helps regularise)
colsample_bytree 0.7 - 1.0 (feature bagging per tree)Library-specific extras worth knowing: reg_alpha and reg_lambda (XGBoost L1 / L2 in objective), min_data_in_leaf (LightGBM, raise to 100+ on small data to neutralise leaf-wise overfitting), l2_leaf_reg and bagging_temperature (CatBoost). Early stopping on a held-out validation set with early_stopping_rounds=50 is non-negotiable — without it n_estimators becomes a guessing game and your model will overfit by hundreds of trees.
Reach for Bayesian optimisation or Optuna for the final 1-2 percent of CV gain, but only after the manual sweep above stops improving.
Common pitfalls
When candidates fail this question, it is almost always one of the same five mistakes — and three of them happen in the categorical pipeline.
The first pitfall is using XGBoost with one-hot encoding on a 1,000-class column. The matrix explodes, training time doubles, and the trees waste depth splitting on sparsity. The fix is to either switch the library to CatBoost, or fold rare categories into an "other" bucket and target-encode the rest with out-of-fold smoothing. If interpretability matters, target encoding plus XGBoost is still cleaner than 1,000 dummy columns.
The second is LightGBM on small data. Leaf-wise growth keeps drilling down on the highest-gain leaf, and on a few thousand rows it memorises noise within fifty iterations. The fix is either to switch to XGBoost or CatBoost, or to raise min_data_in_leaf to 100 or more and cap num_leaves at 31. Both reintroduce the kind of capacity ceiling that level-wise growth has for free.
The third is leaving n_estimators at the default of 100. That is often less than half of what gradient boosting needs to converge on a real dataset. Set it to 5,000, turn on early stopping, and let the validation set tell you when to halt. Equally damaging is the reverse: training 10,000 iterations with no early stopping, then deploying a model that overfit eight thousand trees ago.
The fourth is comparing libraries by training accuracy. All three will hit near-perfect training accuracy on any non-trivial dataset given enough trees — the comparison only makes sense on a held-out validation set or, better, stratified K-fold cross-validation with the same folds across libraries. Different folds give different winners, and that is enough noise to make a bad pick look correct.
The fifth is mixing different categorical encodings between train and serve. If you target-encode at train time with the full dataset and at serve time with only the recent rows, the encoding drifts and the model decays silently. Pin the encoder, version it next to the model artefact, and write a test that compares encoded distributions between train and prod weekly.
Related reading
- Decision trees on the data science interview
- Hyperparameter tuning on the data science interview
- Cross-validation strategies on the data science interview
- Bayesian optimisation interview
If you want to drill questions like this — model picks, growth strategies, leak-safe encodings — every day, NAILDD is launching with a tagged bank of gradient boosting interview problems.
FAQ
Which one should I pick if I can only learn one library?
Learn LightGBM if you work on large tabular data, and CatBoost if your features are mostly categorical or you want strong defaults with minimal tuning. XGBoost is the most widely deployed and the one most legacy systems still run, so it is the safest single answer if you can only memorise one — interviewers will know it and your future colleagues will have written tooling around it.
Why is leaf-wise growth more accurate but riskier?
Leaf-wise growth expands the single leaf that reduces the loss the most, regardless of where it sits in the tree. That means the tree concentrates capacity exactly where the data wants it, which is more sample-efficient on large datasets. The risk is that on small or noisy datasets the highest-gain split often reflects noise, and leaf-wise keeps drilling into it. Level-wise growth (XGBoost) spreads the splits across the whole depth, which acts as implicit regularisation.
Does CatBoost really have no leakage in categorical encoding?
CatBoost's ordered boosting computes the target statistic for each row using only earlier rows in a random permutation, and it averages results across multiple permutations. That blocks the standard leakage path where the encoded value sees its own target. It is not magic — if you have post-event features (a column derived from data only available after the prediction time), no encoding will save you. But for standard categorical columns, CatBoost's encoding is the closest thing to leak-safe out of the box.
How much faster is LightGBM than XGBoost in practice?
On 1-10 million rows with modest feature counts, expect 2 to 5x faster training. On 50+ million rows and high cardinality, 5 to 10x is normal. The gap narrows on smaller datasets where the histogram setup cost dominates. On GPU the ranking can flip — XGBoost and CatBoost have very strong GPU paths, so benchmark before locking in a library for production.
Can I use these libraries for ranking or survival, not just classification and regression?
Yes. All three ship learning-to-rank objectives (rank:pairwise, lambdarank) and CatBoost has a strong ranking mode used in search. LightGBM and XGBoost both support custom objectives, so any twice-differentiable loss works — including survival analysis with Cox or AFT objectives. Just make sure your evaluation metric matches the objective; using accuracy on a ranking task is a classic interview trap.
Is GPU training always worth it?
Not always. The break-even is around 1 million rows for most CPUs versus a modern GPU. Below that, the CPU implementations are often faster because the GPU launch overhead dominates. Above 10 million rows the GPU wins decisively, especially for CatBoost's dense oblivious trees and LightGBM's histogram path. If your dataset fits in RAM and you only train once a day, GPU is rarely the bottleneck — engineer time tuning the model wins more than hardware.