ClickHouse vs PostgreSQL for analysts

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why this comparison matters

PostgreSQL is the default OLTP engine at almost every Series A startup and the dialect you will hit in 80% of analyst interviews. ClickHouse is the default OLAP engine at companies that outgrew their nightly Postgres aggregation — Uber's logistics analytics, Cloudflare's request log warehouse, and a long list of late-stage startups running on billions of rows per day.

The probe at a staff interview is not "do you know SQL" — they assume that. It is whether you understand why SELECT SUM(amount) FROM events WHERE day = today() returns in 180 milliseconds on ClickHouse with 12 billion rows and times out on a Postgres replica with the same data. The answer is column storage, sparse indexes, and merge trees, and it shows up in the way you write queries and pick engines for dashboards.

The architectural split

PostgreSQL stores data row by row. Every row of an orders table lives in one contiguous block on disk, so a point lookup by primary key pulls all 50 columns in a single I/O. That layout, combined with MVCC and per-row locking, is exactly what an OLTP application needs to let thousands of concurrent users update their own orders cheaply.

ClickHouse stores data column by column. Each column lives in its own compressed parts, sorted by the table's ORDER BY key. A query like SUM(amount) WHERE day = '2026-05-17' reads two columns and ignores the other 48. With 10:1 compression on low-cardinality columns and SIMD aggregation over packed integers, the same machine that struggled with a 4-minute Postgres scan finishes in single-digit seconds. The trade-off is that updating one row means rewriting parts of multiple column files, which is why ClickHouse treats updates as asynchronous mutations.

Load-bearing trick: column storage is fast for analytics because you read 2 of 50 columns, not because the disk is magic. If your query touches every column (SELECT *), the column engine has no edge.

Feature comparison table

Dimension PostgreSQL ClickHouse
Storage layout Row-oriented (heap + indexes) Column-oriented (MergeTree parts)
Workload sweet spot OLTP, CRUD, mixed read/write OLAP, aggregations, time-series scans
Aggregation on 100M rows 30s to several minutes Sub-second to seconds
Single-row INSERT latency <1 ms, fully transactional Discouraged; batch inserts of 1k+ rows
UPDATE / DELETE First-class, row-level Async mutations, expensive
Point lookup by PK <1 ms with B-tree index Slower; sparse index, ~8k row granule
JOIN of two 1B-row tables Slow, often impractical Fast but memory-hungry; not idiomatic
Concurrency model MVCC, thousands of writers Append-heavy, single-writer per shard ideal
Schema flexibility ALTER TABLE is cheap ALTER on huge tables can be expensive
Approximate functions None native uniq, quantile, topK, uniqHLL12
Materialized views Manual REFRESH Auto-incremental on insert
Replication Streaming WAL replicas Built-in ReplicatedMergeTree + Keeper

The row worth memorizing is single-row INSERT: Postgres handles per-row writes all day long, ClickHouse accepts them but the merge tree generates a new part per insert and the compactor falls behind within minutes. The standard fix is to buffer writes — a Buffer table, a Kafka engine table, or an application-side batch of 10k to 100k rows per insert. Engineers who learn this the hard way usually learn it during their first on-call rotation.

Syntax differences that bite

The dialect drift between the two engines is small in volume but high in friction. The functions you reach for daily — date truncation, conditional aggregation, distinct counts, string aggregation — all have different names. Memorize the top ten and your code reviews will go faster.

-- PostgreSQL
SELECT DATE_TRUNC('month', created_at) AS month,
       COUNT(*) FILTER (WHERE status = 'paid')   AS paid_orders,
       SUM(amount) FILTER (WHERE status = 'paid') AS paid_revenue,
       COUNT(DISTINCT user_id)                   AS active_users
FROM orders
WHERE created_at >= NOW() - INTERVAL '90 days'
GROUP BY 1
ORDER BY 1;
-- ClickHouse equivalent
SELECT toStartOfMonth(created_at) AS month,
       countIf(status = 'paid')               AS paid_orders,
       sumIf(amount, status = 'paid')         AS paid_revenue,
       uniq(user_id)                          AS active_users
FROM orders
WHERE created_at >= now() - INTERVAL 90 DAY
GROUP BY month
ORDER BY month;

Three small changes, three gotchas. toStartOfMonth returns a Date rather than a Timestamp, which matters if you later join on the column. countIf and sumIf are conditional aggregates baked into the engine — they vectorize better than FILTER. uniq is an approximate count distinct backed by HyperLogLog with roughly 2% relative error; for exact answers you opt in with uniqExact, which is slower and memory-hungry.

-- Approximate vs exact distinct in ClickHouse
SELECT uniq(user_id)        AS approx_users,        -- HLL, fast, ~2% error
       uniqExact(user_id)   AS exact_users,         -- exact, slower
       uniqHLL12(user_id)   AS tunable_hll_users    -- 12-bit HLL, smaller memory
FROM events
WHERE event_date = today();

ClickHouse's array model is its other superpower for analysts. You collect values with groupArray, transform with arrayMap, filter with arrayFilter, explode with arrayJoin. Postgres has arrays too, but they are not idiomatic for analytics — Postgres analysts reach for STRING_AGG and lateral joins, ClickHouse analysts reach for groupArray and arrayJoin daily.

-- ClickHouse: build per-user event sequences and pick top 5 actions
SELECT user_id,
       arraySlice(arrayReverseSort(groupArray(event_type)), 1, 5) AS top_actions
FROM events
WHERE event_date >= today() - 30
GROUP BY user_id;
-- ClickHouse: explode an array column back into one row per element
SELECT user_id, arrayJoin(product_ids) AS product_id
FROM user_baskets
WHERE event_date = today();

Window functions look identical on the surface — ROW_NUMBER, RANK, LAG, LEAD, SUM(...) OVER. The trap is ROWS BETWEEN n PRECEDING AND CURRENT ROW: ClickHouse supports the common shapes, but for unusual frame definitions or session-style logic you often switch to array tricks. In practice, 90% of the window queries you wrote in Postgres work unchanged in ClickHouse.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

What interviewers actually ask

Five questions cover most of the territory at a senior data interview. The phrasing varies, the underlying ideas are stable.

Why is a columnar store faster for analytics? Analytical queries touch a small subset of columns. A 50-column events table aggregated on amount and day reads two files in column storage versus 50 in row storage. Add compression — 5:1 on high-cardinality columns, 30:1 on low-cardinality columns like country — and the I/O bill drops by an order of magnitude. SIMD aggregation multiplies the gap.

Why does ClickHouse discourage UPDATE and DELETE? Each mutation rewrites large chunks of the merge tree and runs as an async job, which means eventual consistency and contention with normal merges. The idiomatic patterns are ReplacingMergeTree (latest row per sort key wins), CollapsingMergeTree (paired insert with sign = -1), or VersionedCollapsingMergeTree for CDC pipelines.

When would you pick Postgres over ClickHouse on a large table? Three cases. Point lookup by primary key — Postgres B-trees return in under a millisecond, ClickHouse pays the 8,192-row granule cost. Row-level updates as core workload — orders, inventory. ACID multi-statement transactions, which ClickHouse does not support at the row level.

What is MergeTree? The default storage engine family. Data is written in immutable parts sorted by ORDER BY key; a background process merges small parts into larger ones, LSM-style. Variants to know: ReplacingMergeTree (deduplication), AggregatingMergeTree (incremental rollups), SummingMergeTree (sums on merge), ReplicatedMergeTree (multi-replica with Keeper).

Write a query that behaves differently in each engine. The classic: SELECT DISTINCT ON (user_id) user_id, amount FROM orders ORDER BY user_id, created_at DESC is Postgres-only. The ClickHouse equivalent is SELECT user_id, argMax(amount, created_at) FROM orders GROUP BY user_id — and it is faster on a 100M-row table because it skips the full sort.

PostgreSQL to ClickHouse cheat sheet

PostgreSQL ClickHouse
DATE_TRUNC('month', dt) toStartOfMonth(dt)
DATE_TRUNC('day', dt) toDate(dt)
EXTRACT(YEAR FROM dt) toYear(dt)
dt + INTERVAL '7 days' addDays(dt, 7) or dt + INTERVAL 7 DAY
AGE(dt1, dt2) dateDiff('day', dt2, dt1)
COUNT(*) FILTER (WHERE x) countIf(x)
SUM(a) FILTER (WHERE x) sumIf(a, x)
AVG(a) FILTER (WHERE x) avgIf(a, x)
COUNT(DISTINCT col) uniq(col) or uniqExact(col)
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) quantile(0.5)(col)
STRING_AGG(col, ',') arrayStringConcat(groupArray(col), ',')
ARRAY_AGG(col) groupArray(col)
JSONB_EXTRACT_PATH_TEXT(j, 'k') JSONExtractString(j, 'k')
DISTINCT ON (col) argMax(other_col, order_col)
GENERATE_SERIES(1, 100) numbers(100)
UNNEST(arr) arrayJoin(arr)
NOW() now()
CURRENT_DATE today()
COALESCE(a, b) coalesce(a, b) or ifNull(a, b)
CASE WHEN ... THEN ... END identical

Pin this to your wall, you will look at it more often than you expect.

Common pitfalls

The most expensive mistake new ClickHouse users make is single-row INSERT loops. A naive ETL that reads from Kafka one message at a time creates one merge tree part per insert and surfaces as Too many parts (300). Merges are processing significantly slower than inserts within an hour. The fix is to batch 10,000 to 100,000 rows per insert, or put a Buffer table or Kafka engine table between producer and storage.

A close second is treating ClickHouse JOIN like Postgres JOIN. ClickHouse defaults to broadcast hash joins where the right-hand table is loaded into memory on every node — fine for a 10M-row dimension against a 10B-row fact, deadly if you swap the order. Put the smaller table on the right, use GLOBAL JOIN for distributed setups, or denormalize into a wide table with LowCardinality(String) for repeated values.

Time zone handling is another quiet trap. PostgreSQL's TIMESTAMP WITH TIME ZONE stores UTC and converts at display. ClickHouse's DateTime is timezone-naive by default; you declare DateTime('UTC') per column, and mixing columns with different declared timezones in a WHERE clause silently uses the server's setting. Standardize on DateTime('UTC') across the schema and convert at the dashboard layer.

A subtler pitfall is relying on uniq for billing-grade counts. uniq is HyperLogLog with about 2% relative error — perfect for product dashboards, dangerous for invoices. If finance asks "how many unique customers paid us last month," reach for uniqExact even though it is slower. Approximate is a feature, but the trade-off has to be explicit.

Schema design for ClickHouse demands more thought than Postgres because the ORDER BY key cannot change after creation without rewriting the table. Picking ORDER BY (user_id, event_time) versus ORDER BY (event_time, user_id) swings query speed by an order of magnitude depending on whether dashboards filter by user or date. Match the most common WHERE clauses on the largest tables and put low-cardinality columns first.

If you want a drill bench that includes both PostgreSQL and ClickHouse-style aggregation questions, NAILDD is launching with 500+ SQL problems graded against real interview rubrics.

FAQ

Does an analyst really need to know ClickHouse?

For junior roles, no — PostgreSQL covers the syntax bar at almost every interview. For mid and senior roles at companies past Series C, yes: at minimum the syntax differences (toStartOfMonth, countIf, uniq, argMax), the row-vs-column distinction, and why you batch inserts. Deep MergeTree knowledge shows up in staff-level system design rounds.

Can I write PostgreSQL syntax in a ClickHouse interview?

Usually yes, as long as you flag it. Most interviewers evaluate logic over dialect, so writing DATE_TRUNC in a ClickHouse shop will not fail you. Saying "I would write this as toStartOfMonth in ClickHouse, but the logic is the same" signals that you know the gap exists. The opposite — ClickHouse-only syntax in a Postgres shop — is riskier because the interviewer may not catch your intent.

Which engine should I learn first?

PostgreSQL, every time. Its syntax is the closest mainstream dialect to standard SQL, it is the default on every coding platform, and the concepts — joins, window functions, CTEs, query plans — transfer almost completely. Learn ClickHouse second, when you already have a Postgres reflex.

What about BigQuery, Snowflake, or Redshift?

All three are column-oriented analytical engines like ClickHouse, but they are managed cloud services with different pricing models and dialects. The mental model — columnar storage, vectorized aggregation, sparse indexes, approximate functions — transfers directly. If your target company runs on Snowflake, learn Snowflake syntax; the underlying intuition is the same.

How big does my data have to be before ClickHouse pays off?

Roughly, when a Postgres aggregation on the table you care about takes more than 30 seconds even with the right index. That typically lands between 100 million and 1 billion rows for an events table, depending on row width and how many columns the query touches. Below that threshold, a well-indexed Postgres with pg_stat_statements is almost always the right answer.