May 18, 2026·12 min read

Pandas performance optimization

Q: How do I profile pandas code properly?

Start with `%timeit` on individual operations to find the slow line, then use `%prun` or `line_profiler` to attribute the cost. For memory, `memory_profiler` and `df.info(memory_usage='deep')` are the two tools you need. The mistake to avoid is profiling on a tiny test frame — the constants dominate at small sizes and you draw the wrong conclusions. Profile on at least 1M rows.

Q: Is pyarrow-backed pandas worth turning on?

Yes, in 2026, for most analytics workloads. Pass `dtype_backend='pyarrow'` to `read_csv` or `read_parquet` and you get faster string operations, faster IO, and proper nullable types for free. The rough edges are around legacy code that assumes numpy semantics for `NaN` versus pyarrow's `null`; if you have a large existing codebase, migrate one IO call at a time rather than flipping the global.

Practice Python for data interviews

200+ pandas, numpy, and data-wrangling problems with explanations.

Join the waitlist

Contents:

Why pandas gets slow
Benchmark: vectorize vs apply vs loop vs numba
The vectorization toolkit
Memory: dtypes, categoricals, downcasting
When pandas is the wrong tool
Common pitfalls
Related reading
FAQ

Why pandas gets slow

Pandas is the default workhorse for tabular work in Python, and it is genuinely fast when you use it the way the C extensions want to be used. The trouble starts when you treat a DataFrame like a Python list — iterating rows, calling lambdas, building columns one cell at a time. On a 10-row toy frame nobody notices. On a 10M-row event log the same code goes from a sub-second numpy call to a 4-minute Python loop, and a junior starts emailing about "pandas being slow".

It almost never is. The hot path through pandas is vectorized numpy under the hood, and when you bypass it — iterrows, apply with a Python lambda, chained .loc writes — you collapse to interpreter speed. Three numbers matter most: how many rows you have, how many bytes per row, and how many Python function calls your transformation incurs. Get those right and a 10-minute job runs in 30 seconds without changing a single dependency.

This guide is the playbook you run when an interviewer points at a slow script and says "make this faster". It is also the checklist you run on your own code before you blame pandas, switch to polars, or fire up a Spark cluster you do not need.

Load-bearing trick: if you can express the transformation as a sequence of column-wise operations on numpy arrays, pandas is fast. If you cannot, pandas is the wrong abstraction and no tuning will save you.

Benchmark: vectorize vs apply vs loop vs numba

Before tuning anything, calibrate your intuition. The table below benchmarks the same task — compute c = a * b + log(a + 1) on a DataFrame with 10M float64 rows — across four implementations, measured on a Macbook M2 with pandas 2.2 and numba 0.59.

Implementation	Code shape	Wall time	Speedup vs loop
Pure Python `for` loop over `iterrows`	`for i, row in df.iterrows()`	312.0 s	1x baseline
`df.apply(fn, axis=1)`	Python callable per row	41.8 s	7.5x
Vectorized numpy	`df['a'] * df['b'] + np.log1p(df['a'])`	0.18 s	1,733x
Numba `@njit` over numpy arrays	jit-compiled kernel on `.to_numpy()`	0.11 s	2,836x

The shape of the table is the lesson. apply is not a fast escape from a for loop — it is a slightly nicer wrapper around the same per-row Python overhead, worth a single order of magnitude at best. Vectorization buys you three to four orders of magnitude for free. Numba on top adds a final 1.5–2x when the kernel is arithmetic-heavy and you can keep the data in raw numpy arrays.

Sanity check: if your "optimization" replaces a Python loop with apply and you are pleased about a 7x speedup, you left 200x on the table. Go back and vectorize.

The numbers shift by workload — strings and conditionals do not always vectorize as cleanly as arithmetic — but the ordering is stable across every benchmark I have run.

The vectorization toolkit

Vectorization in pandas means operating on whole Series at once, letting the underlying numpy or pyarrow kernel run in C without bouncing through the Python interpreter. The mental model: every time you write df['x'].something(...) and something is a built-in pandas method, you stay in C. Every time you write lambda x: ..., you do not.

The most common rewrites:

import numpy as np
import pandas as pd

# Arithmetic — already vectorized
df['total'] = df['price'] * df['qty']

# String ops — use the .str accessor, not apply
df['upper'] = df['name'].str.upper()
df['domain'] = df['email'].str.split('@').str[1]

# Conditional logic — np.where, np.select beat apply
df['tier'] = np.where(df['mrr'] >= 1000, 'enterprise', 'smb')

df['bucket'] = np.select(
    condlist=[df['age'] < 18, df['age'] < 35, df['age'] < 65],
    choicelist=['minor', 'young_adult', 'adult'],
    default='senior',
)

# Date math — .dt accessor
df['hour'] = df['ts'].dt.hour
df['weekday'] = df['ts'].dt.day_name()

When built-ins cannot express the logic, drop to numpy directly with df['col'].to_numpy() and write the kernel as a numpy expression. If that is still not enough, wrap the kernel in @numba.njit and pass the numpy array — the fourth row of the benchmark. The one thing you should almost never reach for is df.apply(fn, axis=1), because it forces pandas to construct a Python Series per row, the slowest object in the entire stack.

For groupby work, prefer the named aggregations API to apply:

# Fast — pandas dispatches each agg to a C kernel
df.groupby('country').agg(
    revenue=('amount', 'sum'),
    orders=('order_id', 'nunique'),
    avg_basket=('amount', 'mean'),
)

# Slow — Python function called once per group
df.groupby('country').apply(lambda g: custom_metric(g))

The agg form scales linearly with rows. The apply form scales with groups times Python overhead per group, which blows up above a few thousand groups.

Memory: dtypes, categoricals, downcasting

Performance and memory are the same problem in pandas. A DataFrame that fits in L3 cache runs operations at memory bandwidth; one that has to stream from main memory or swap is 10–100x slower regardless of how clever your code is. The first audit you run on a slow frame is df.memory_usage(deep=True).sum() / 1024**2 — if you see hundreds of MB on what should be a small frame, the dtypes are wrong.

The three reliable wins:

Optimization	Typical reduction	When to use
`object` to `category`	5–20x on low-cardinality strings	Columns where `nunique / len < 0.5`
`int64` to `int32` or `int16`	2–4x	Counts, ages, IDs that fit the smaller range
`float64` to `float32`	2x	Most analytics — ML features, ratios, money in cents

A practical sweep on a fresh frame:

def shrink(df: pd.DataFrame) -> pd.DataFrame:
    for col in df.select_dtypes(include='object'):
        if df[col].nunique() / len(df) < 0.5:
            df[col] = df[col].astype('category')

    for col in df.select_dtypes(include='integer'):
        df[col] = pd.to_numeric(df[col], downcast='integer')

    for col in df.select_dtypes(include='float'):
        df[col] = pd.to_numeric(df[col], downcast='float')

    return df

Two cautions. category is cheap to read but expensive to mutate — appending rows or merging frames with mismatched categories costs a re-encode. And float32 changes numerical precision; do not do this on financial columns that need exact arithmetic. Use Decimal or fixed-point integers (cents instead of dollars) for money.

The other free win is reading only the columns you need. pd.read_csv('big.csv', usecols=['user_id', 'amount', 'ts'], dtype={'user_id': 'int32', 'amount': 'float32'}) skips unused columns at parse time and applies the right dtypes immediately, often cutting load time by 5–10x. Better still, store source data as Parquet — column-pruning is native, dtypes survive the round-trip, reads are typically 3–5x faster than CSV.

Practice Python for data interviews

200+ pandas, numpy, and data-wrangling problems with explanations.

Join the waitlist

When pandas is the wrong tool

Pandas has a soft ceiling around single-machine RAM. If your working set fits in 30–50% of your free memory and your operations are vectorized, pandas is excellent. Once the working set crowds the rest of memory — or once you find yourself writing chunked iterators to avoid OOM — you are using the wrong tool.

The honest replacement matrix:

Tool	When it wins	Trade-off
polars	Single-machine, 10M–500M rows, mostly vectorized work	Different API; lazy mode requires rewriting expressions
DuckDB	You already think in SQL; ad-hoc analytics on Parquet	Worse for ML feature pipelines
numba / numpy	Heavy arithmetic kernels on numeric arrays	No DataFrame ergonomics
Dask	Frame is 2–10x larger than RAM, pandas API matters	Real distributed bugs; not a silver bullet
Spark	100M+ rows, cluster available, ETL more than analysis	Operational overhead, JVM tuning

A common pattern in 2026 stacks is bulk filter and aggregation in DuckDB or polars against Parquet on S3, then handing the smaller resulting frame to pandas for last-mile feature engineering and model handoff. The heavy lifting stays in a Rust or C++ engine; pandas is used only where its ecosystem (scikit-learn, statsmodels, matplotlib) actually pays off.

Gotcha: do not switch to polars or Spark just because pandas "feels slow". Profile first. Roughly 80% of "slow pandas" tickets I have triaged were a single apply or a chained-indexing write that disappeared with a one-line fix.

Common pitfalls

The most expensive pandas mistake is iterating rows with iterrows or itertuples when a vectorized expression exists. The benchmark puts a number on it — three orders of magnitude — but the social cost is worse, because the loop is usually buried inside a function called many times and the slowness gets blamed on data volume. The fix is to read the loop carefully, identify the per-row operation, and rewrite it as a Series expression.

A close second is chained indexing — writes of the form df[df['x'] > 0]['y'] = 1. This silently does nothing, because pandas returns a copy of the filtered view and assigns to it. The SettingWithCopyWarning exists for this case, but in practice teams suppress it. The correct form is df.loc[df['x'] > 0, 'y'] = 1, which fixes the bug and runs faster.

Another trap is object dtype on string columns. Reading a CSV by default gives you object-typed strings, essentially Python lists of strings where every operation pays Python overhead. Converting to category (low cardinality) or string[pyarrow] (high cardinality) restores C-speed string ops and cuts memory by 5–20x. Modern pandas 2.0+ makes this trivial: pd.read_csv(path, dtype_backend='pyarrow') gives you arrow-backed strings everywhere.

The fourth pitfall is abusing inplace=True as a perceived performance trick. It is not. In pandas 2.x most inplace=True operations are deprecated, and where they work they rarely save memory because the implementation still allocates intermediate buffers. Use df = df.method(...) and let the GC do its job.

The last is forgetting that joins are O(rows × keys) when the key is not unique on both sides. A merge between two 10M-row frames on a non-unique key can explode into a 100M-row intermediate, OOM-ing the process. Fixes: deduplicate keys before the merge, set the join key as the index on the smaller frame (enables a hash join), or switch to DuckDB which handles this case without materializing the explosion.

If you want to drill pandas and Python questions like this in a structured way, NAILDD is launching with hundreds of real interview-grade problems on exactly this pattern.

FAQ

Should I switch from pandas to polars for new projects?

For greenfield analytics work on single machines with frames above a few million rows, polars is usually the better default in 2026 — it is faster, the lazy engine catches optimization mistakes pandas would miss, and the API has fewer footguns. For ML feature engineering that feeds scikit-learn or Hugging Face transformers, the ecosystem gravity still favors pandas. Most teams standardize on polars for ETL and aggregation, then convert to pandas only at the model handoff.

Does `apply` ever beat vectorization?

Almost never, but there are honest exceptions. If your operation calls an external service per row (a model API, a geocoder), the bottleneck is network latency, not pandas, and apply with a thread pool via swifter or pandarallel is fine. The other exception is genuinely irreducible per-row logic — a stateful parser whose output depends on the previous row. For everything else, if you think you need apply, you probably need np.select or .str methods.

What about modin as a drop-in replacement?

Modin promises "change one import line and pandas gets faster". In practice it works for embarrassingly parallel operations on large frames and disappoints on harder cases (joins, multi-step pipelines, mixed dtypes). It is worth trying because the cost is one line, but do not budget a project around the speedup. If you need real parallelism, you almost always end up on polars or DuckDB instead.

How do I profile pandas code properly?

Start with %timeit on individual operations to find the slow line, then use %prun or line_profiler to attribute the cost. For memory, memory_profiler and df.info(memory_usage='deep') are the two tools you need. The mistake to avoid is profiling on a tiny test frame — the constants dominate at small sizes and you draw the wrong conclusions. Profile on at least 1M rows.

Is pyarrow-backed pandas worth turning on?

Yes, in 2026, for most analytics workloads. Pass dtype_backend='pyarrow' to read_csv or read_parquet and you get faster string operations, faster IO, and proper nullable types for free. The rough edges are around legacy code that assumes numpy semantics for NaN versus pyarrow's null; if you have a large existing codebase, migrate one IO call at a time rather than flipping the global.

When is numba actually worth the complexity?

When you have a numeric kernel that runs over millions of rows, cannot be expressed cleanly in numpy, and is called repeatedly. Examples: a custom rolling window with non-standard semantics, a path-dependent feature like cumulative drawdown with reset rules, a Monte Carlo simulation inside a feature pipeline. Numba's @njit compiles the function to native code on first call and runs at C speed thereafter. The cost is that it only supports a numeric subset of Python and debugging compiler errors is painful — for nightly production pipelines it pays off, for one-off scripts it usually does not.