Pandas memory optimization

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Why pandas eats memory

Pandas sits on top of NumPy arrays, and NumPy defaults to the widest sane type for every column it reads. The two numbers that matter are 8 bytes per int64 cell and roughly 50+ bytes per object cell when that object is a Python string. A 10M-row, 20-column CSV that should comfortably fit in 500 MB routinely balloons to 2-3 GB in memory because read_csv picks int64 for everything numeric and object for anything that looks like text. At 100M rows the kernel dies; at 1B rows you never even started.

The fix is not a clever library — it is choosing the narrowest correct dtype for each column and pushing that choice as early as possible, ideally into read_csv itself. Done correctly, the same DataFrame drops to a fraction of its original size with no loss of precision and, in some cases, faster group-by performance as a bonus.

The mental model: pandas reserves space for the worst case unless you tell it otherwise.

Load-bearing trick: specify dtype= inside read_csv rather than casting after the fact. Reading then casting allocates both the wide and narrow arrays at the same time — you peak at the bad number.

Diagnose first, optimize second

Before you change anything, measure. The number to trust is memory_usage(deep=True), because the shallow version reports 8 bytes per pointer for object columns and lies to you about strings.

import pandas as pd

df = pd.read_csv('events.csv')

print(df.info(memory_usage='deep'))
print(df.memory_usage(deep=True).sort_values(ascending=False).head(10))

In practice the top one to three columns hold 80% of total footprint. They are almost always either high-cardinality strings stored as object or int64 columns whose actual range fits in int16. Optimize those first; the rest is rounding error.

dtype reference and downcasting

Numeric downcasting is the cheapest win. Pick the smallest type that holds your actual range — not the theoretical maximum, the values you see in the data.

dtype Bytes per cell Range When to use
int8 1 -128 to 127 Flags, small enums, age, day-of-month
int16 2 -32,768 to 32,767 Year, small counts, ZIP-like codes
int32 4 ±2.1B User IDs under ~2B, row counts
int64 8 ±9.2 quintillion Timestamps in ns, very large IDs
float32 4 ~7 significant digits Prices, probabilities, scores
float64 8 ~15 significant digits Scientific precision only
category varies depends on cardinality Strings with <1% unique ratio
object (str) ~50+ Python string overhead Genuinely free-form text

A 100M-row column of int64 user IDs costs 800 MB. The same column as int32 costs 400 MB. As int16, when applicable, 200 MB. You did not lose any information; you stopped paying for headroom you were not using.

Manual cast:

df['user_id'] = df['user_id'].astype('int32')
df['age'] = df['age'].astype('int8')
df['amount'] = df['amount'].astype('float32')

Or let pandas pick the smallest type that fits:

df['amount'] = pd.to_numeric(df['amount'], downcast='integer')
df['price'] = pd.to_numeric(df['price'], downcast='float')

The downcast keyword scans the column once and chooses the minimum width. For a mixed DataFrame, loop over numeric columns:

for col in df.select_dtypes(include='int64').columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')

for col in df.select_dtypes(include='float64').columns:
    df[col] = pd.to_numeric(df[col], downcast='float')

This single loop typically reclaims 30-50% of a numeric-heavy DataFrame on its own.

Categoricals, nullable ints, and parquet

For strings with a fixed vocabulary — country, status, device_type, subscription_tier — convert to category. Pandas stores the unique values once and replaces each cell with an integer code, so a column of 10M rows drawn from 12 distinct countries collapses from roughly 500 MB to under 20 MB.

df['country'] = df['country'].astype('category')
df['status'] = df['status'].astype('category')

The rule of thumb is the 1% threshold: if a column has fewer unique values than 1% of its row count, make it categorical. Above that, the overhead of maintaining the lookup outweighs the savings. Also skip categoricals for columns you actively mutate as strings — .str.lower(), .str.split(...) — because those operations force a round trip back to object.

Sanity check: before converting to category, run df[col].nunique() / len(df). If the ratio is above 0.01, leave it alone or look at hashing instead.

Nullable integers solve a different but related problem. Vanilla int64 cannot hold NaN, so the moment a single null sneaks into a column pandas silently upcasts the whole thing to float64 — doubling memory on what should have been an int32. The fix is the capital-I family:

df['user_id'] = df['user_id'].astype('Int64')   # nullable, integer-typed
df['score'] = df['score'].astype('Float32')
df['is_active'] = df['is_active'].astype('boolean')

These types carry a separate mask for missing values and stay integer-shaped, which matters both for memory and for downstream joins that expect integer keys.

Parquet is the other big lever. CSV is text — every digit is a byte, every comma is a byte, every type has to be inferred. Parquet is columnar, typed, and compressed.

Format Disk size Read memory Read time Preserves dtypes
CSV (gzip) 1.2 GB 2.0 GB 45 s No
CSV (raw) 5.0 GB 2.0 GB 38 s No
Parquet (snappy) 1.0 GB 0.8 GB 6 s Yes
Parquet (zstd) 0.7 GB 0.8 GB 7 s Yes

Same 10M-row table across formats. Parquet is 5-7x faster to read, 2.5x smaller in memory, and round-trips your categoricals and nullable ints without you reciting the schema again.

df.to_parquet('events.parquet', compression='snappy')
df = pd.read_parquet('events.parquet', columns=['user_id', 'amount', 'event_ts'])

The columns= argument is the unsung hero: a parquet file with 50 columns from which you need 3 is read roughly 15x faster than a full scan, because the unused columns are never touched on disk.

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Chunked reads and benchmark results

When the source file is bigger than RAM no matter what dtypes you pick, switch to chunked reads. The point is to never hold the whole file in memory at once — you stream, filter or aggregate, and concatenate only what survives.

total = 0.0
for chunk in pd.read_csv('huge.csv', chunksize=200_000,
                         dtype={'user_id': 'int32', 'amount': 'float32'}):
    total += chunk.loc[chunk['amount'] > 0, 'amount'].sum()

Notice the dtypes are pushed into the iterator — every chunk lands narrow, not wide. For row-level filtering with no aggregation, collect the surviving rows and concatenate at the end:

keep = []
for chunk in pd.read_csv('huge.csv', chunksize=200_000,
                         dtype={'user_id': 'int32'}):
    keep.append(chunk[chunk['country'] == 'US'])
df = pd.concat(keep, ignore_index=True)

Here is a real before/after on a 10M-row events table I keep around for benchmarking. Columns: user_id, event_type, country, device, amount, ts, is_paid, plus 13 numeric features.

Stage Total memory int64 cols object cols float64 cols
Naive read_csv 2,140 MB 760 MB 980 MB 320 MB
After numeric downcast 1,360 MB 240 MB 980 MB 110 MB
After categoricals 410 MB 240 MB 30 MB 110 MB
read_csv with full dtype map 380 MB 220 MB 28 MB 105 MB
Same table from parquet 380 MB 220 MB 28 MB 105 MB

End to end: 2.14 GB down to 0.38 GB, an 82% reduction, with the bulk coming from the categoricals on event_type, country, and device and the int downcasting on the feature columns. The last two rows show that once your dtype map is right, the file format only affects read time, not steady-state memory.

If you want to drill pandas and SQL performance questions like this every day, NAILDD is launching with 500+ analytics interview problems across exactly these patterns.

Common pitfalls

The first mistake is casting after a full read instead of inside read_csv. The wide DataFrame still gets materialized once, so the peak memory is identical to the unoptimized case — you only saved the steady state. The fix is the dtype= dictionary on read_csv plus parse_dates= for timestamps. If you have to discover dtypes empirically, do it on a sampled nrows=100_000 read first, then apply the map to the full file.

A close second is silent float upcasting from a single NaN. You set user_id to int32, a downstream join introduces missing rows, and the column quietly becomes float64 — doubling its footprint and breaking any later code that expects integer keys. The fix is to use Int32 (capital I) for any integer column that may ever contain nulls, or to fill nulls before the cast with a sentinel value you treat carefully.

The third trap is over-eager categoricals. Converting an email or event_id column to category creates a lookup table nearly as large as the original column plus a code array on top — strictly worse than leaving it as object. The 1% unique ratio is the right cutoff. Email, URL, session ID, and free-text comment fields are not categorical no matter how tempting the conversion looks.

The fourth is reading every column when you need three. usecols=['user_id', 'amount', 'ts'] is the cheapest optimization in this entire guide and the most consistently forgotten. The same goes for parquet's columns= argument. If your downstream code only touches a handful of fields, the IO layer should never have read the rest.

The fifth is forgetting that copies hide in chained operations. A df[df.x > 0].assign(y=...) allocates a fresh DataFrame; doing that inside a loop and appending to a list keeps every intermediate alive until the loop ends. Either aggregate inline as you go, or write the survivors to disk and reload, so the GC has a chance to reclaim memory between iterations.

FAQ

How much memory do these optimizations actually save?

For a typical mixed-type analytics DataFrame, expect 50-80% reduction end to end. The split is roughly half from categoricals on low-cardinality string columns and half from numeric downcasting. If your data is dominated by genuinely unique strings (URLs, free text, hashed IDs) the savings drop to 20-30% because there is nothing to compress; in that case the answer is usually chunking or a switch of tool, not more dtype tweaking.

Does narrowing dtypes hurt computation speed?

For categoricals, group-by and sort get faster, often noticeably, because integer comparison beats string comparison. For narrow numeric types the picture is mixed: many NumPy paths are most optimized for the 64-bit width, so an int16 column can be a few percent slower per operation than int64. In practice the memory pressure savings dominate — the kernel not swapping is worth more than 5% per op — but if you are tuning a tight inner loop, profile both.

When should I just give up on pandas and switch tools?

When the working set is bigger than ~10 GB even after optimization, or when you have hit the wall on chunked reads because you need joins across chunks that do not fit. Polars is the easiest jump — similar API, lazy execution, much better memory behavior. DuckDB lets you query parquet and CSV files directly with SQL without ever materializing a DataFrame. Dask keeps the pandas API but distributes work across cores or machines. Spark is the answer once you are on a cluster.

How do I handle datetime columns efficiently?

Always pass parse_dates=[...] to read_csv so timestamps land as datetime64[ns] instead of strings; the difference is roughly 8 bytes per cell vs 50+ bytes. For column-by-column conversion of an existing DataFrame, pd.to_datetime(df['ts'], format='%Y-%m-%d') with an explicit format is dramatically faster than letting pandas infer. If you only need day granularity, downcast to datetime64[s] (pandas 2.0+) for further savings.

Do these tricks apply to MultiIndex DataFrames?

Yes — each level of a MultiIndex has its own dtype and can be cast independently. A common win is converting string-typed index levels to category exactly the way you would a column. Beware that some operations (notably unstack and certain joins) will widen index dtypes back to object as an intermediate step, so check df.index.dtypes after any reshape.