Materialized views on the DE interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What a materialized view actually is

A regular SQL view is just a stored query — every read re-executes. A materialized view (MV) is the same query, except the result is physically stored on disk. Reads are milliseconds against a few thousand pre-aggregated rows instead of seconds against a ten-terabyte fact table. The trade-off: the result drifts out of sync with the source, so something has to refresh it.

At a Snowflake or Databricks DE interview the question lands on three axes: when do you replace a regular view with a materialized one, how do you refresh without locking out readers, and what differs between Postgres and ClickHouse MVs. The interviewer wants reasoning about read speed, freshness, and maintenance cost — and a refresh strategy that matches the business SLA.

The hidden cost of not using an MV shows up in compute bills. An analyst at Stripe pulling "revenue by day for the last 90 days" out of a 10TB orders fact table burns thirty seconds of warehouse time per query. A dashboard hitting that query a hundred times a day is roughly an hour of compute on a single tile. An MV pre-aggregates once, every read becomes 200 ms, and the bill drops by an order of magnitude.

Materialized views in Postgres and Greenplum

Postgres has the simplest possible materialized view implementation. You write the query once and the result is persisted. Reads hit the persisted relation directly; the optimizer treats it like an ordinary table, so you can index it the same way you would any other.

CREATE MATERIALIZED VIEW orders_daily AS
SELECT
    DATE_TRUNC('day', created_at) AS order_date,
    COUNT(*) AS orders_cnt,
    SUM(amount) AS revenue
FROM orders
GROUP BY 1;

CREATE INDEX ON orders_daily (order_date);

Refresh is an explicit, manual operation. There is no automatic propagation when the base table changes.

REFRESH MATERIALIZED VIEW orders_daily;
REFRESH MATERIALIZED VIEW CONCURRENTLY orders_daily;

The first form takes an ACCESS EXCLUSIVE lock on the view for the duration of the rebuild — every read and write against the MV blocks until the refresh finishes. The CONCURRENTLY variant writes the new result to a temp relation and swaps the heap with a brief lock at the very end. Concurrent refresh requires at least one UNIQUE index on the view so Postgres can diff old rows against new ones; without that index the refresh aborts.

The honest weakness of Postgres MVs is the lack of native incremental refresh. Every refresh is a full rebuild from the source query. On a two-billion-row fact table that means scanning two billion rows even if only ten thousand new ones arrived. The standard workaround is to drive refreshes from outside the database — a cron job, an Airflow DAG, or a dbt run scheduled at the cadence your freshness SLA can tolerate. Greenplum inherits the same model with the added wrinkle that refresh has to ship data across segment hosts, so daily-partitioned refresh is almost always preferable to a full rebuild.

Materialized views in ClickHouse

ClickHouse takes a different approach. An MV in ClickHouse is not a cached result set — it is a trigger on INSERT into the source that writes aggregated rows into a target table, usually a SummingMergeTree or AggregatingMergeTree that merges partial aggregates in the background.

CREATE TABLE orders (
    order_id UInt64,
    user_id UInt64,
    amount Decimal(18,2),
    created_at DateTime
) ENGINE = MergeTree ORDER BY created_at;

CREATE MATERIALIZED VIEW orders_daily_mv
ENGINE = SummingMergeTree
ORDER BY order_date
AS
SELECT
    toDate(created_at) AS order_date,
    count() AS orders_cnt,
    sum(amount) AS revenue
FROM orders
GROUP BY 1;

The mechanics matter because they explain every quirk that follows. When a batch of one million rows is inserted into orders, the MV's SELECT runs against that batch only — not the full source. The result is written to MV storage. SummingMergeTree merges partial sums for the same order_date during background merges, so the final state is the correct aggregate even though no single write touched every existing row.

The MV is genuinely incremental and effectively real-time — no cron, no orchestrator, no manual refresh. But it only fires on writes through the source table. Backfill historical data directly into the target and the MV does not catch up. ALTER TABLE ... DELETE or ALTER TABLE ... UPDATE rows in the source and the MV is not re-computed — mutations in ClickHouse are deliberately rare and the MV does not subscribe to them. Ten MVs on one source table means every INSERT runs ten SELECT statements before acknowledging the write, so insert throughput drops in proportion.

ClickHouse 23.x added REFRESHABLE materialized views, which look more like the Postgres model — a target table, a SELECT, and a schedule:

CREATE MATERIALIZED VIEW orders_daily
REFRESH EVERY 5 MINUTE
TO orders_daily_target
AS SELECT toDate(created_at) AS order_date,
          count()           AS orders_cnt,
          sum(amount)       AS revenue
FROM orders GROUP BY 1;

Refreshable MVs are full rebuilds on a timer — they trade incrementality for simplicity. The right pick when the source has mutations the trigger-based MV would miss, or when the aggregation cannot be expressed as a streaming sum.

Refresh strategies

There are four shapes of refresh you should be ready to name in an interview.

Full refresh rebuilds the entire result from scratch. Always correct, the only option for Postgres MVs out of the box. The cost scales with the total size of the source query, so it stops being viable once the refresh takes longer than the SLA window.

Incremental refresh updates only the partitions or rows that have changed. In Postgres you implement this yourself by deleting the affected partition and re-inserting it, usually orchestrated by dbt or Airflow:

WITH new_data AS (
    SELECT
        DATE_TRUNC('day', created_at) AS order_date,
        COUNT(*)                       AS orders_cnt,
        SUM(amount)                    AS revenue
    FROM orders
    WHERE created_at >= '{{ ds }}'::DATE
      AND created_at <  '{{ ds }}'::DATE + INTERVAL '1 day'
    GROUP BY 1
)
DELETE FROM orders_daily WHERE order_date = '{{ ds }}'::DATE;
INSERT INTO orders_daily SELECT * FROM new_data;

This is functionally a dbt incremental model with unique_key = order_date and a delete-insert strategy. Interviewers love to ask "so is this an MV or a dbt model?" The honest answer is "both — the durable artifact is a table, the orchestration is in dbt".

Triggered refresh fires on a source event. ClickHouse's MV on INSERT is the canonical example. CDC pipelines built on Debezium and Kafka sit in the same category.

Time-based refresh runs on a cron. Cheapest to operate, hardest to tune. The SLA decides the cadence: real-time dashboards demand streaming or trigger-based approaches, hourly business reporting tolerates a five-minute cron, finance reconciliation wants a single overnight run on a clearly defined cutoff.

A rough mapping that holds up in most interviews: seconds-level freshness implies ClickHouse triggered MVs, streaming engines, or Materialize; minute-level fits ClickHouse refreshable MVs or a fast Airflow DAG; hourly tolerates dbt incremental on a cron; daily can be served by a Postgres MV with an overnight refresh.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

MV versus dbt incremental model

This comparison comes up at almost every data engineering interview because the two patterns overlap and the interviewer wants to know whether you understand the difference.

Reach for a materialized view when the source and the result live in the same database, when you want the engine itself to manage refresh, and when you do not have an external orchestrator. ClickHouse MVs are particularly compelling because incremental refresh is built into the engine — no Airflow needed for the aggregation itself.

Reach for a dbt incremental model when you need cross-database lineage, when the pipeline has multiple steps with tests and documentation, when you want CI around every transformation change, or when the source and target may eventually move to different systems. dbt is engine-agnostic — the same logic runs on Snowflake, BigQuery, Databricks, ClickHouse, or Postgres.

The modern stance at most companies is that dbt incremental tables replace Postgres MVs in production analytics. ClickHouse MVs are the exception because the incremental engine is part of the database itself.

Common pitfalls

Teams hit the same handful of mistakes when they first ship materialized views. Naming any three tells the panel you have shipped MVs to production rather than just read about them.

The most frequent failure is treating a Postgres MV as if it refreshes itself. There is no background process — create the MV at noon, the source updates at one o'clock, and the MV still reflects noon forever until something calls REFRESH MATERIALIZED VIEW. The fix is to put refresh on a schedule in the same merge request that creates the MV, owned by the same person who owns the dashboard on top.

Running REFRESH without CONCURRENTLY against a production-facing aggregate is the next trap. The plain refresh takes an exclusive lock for the entire rebuild duration, so any dashboard reading the MV returns errors or hangs. Add the UNIQUE index and always run refreshes with CONCURRENTLY, after testing the refresh time on a representative dataset.

ClickHouse MVs over sources that take UPDATE or DELETE mutations silently go stale. The trigger only fires on INSERT, so any change that bypasses an insert is invisible. If rows can be updated, switch the source engine to ReplacingMergeTree and design the MV to handle deduplication, or move the pipeline to a refreshable MV.

MV pile-up on a single source table is a throughput killer invisible until production traffic hits. Every INSERT has to execute every attached MV's SELECT before acknowledging. Three MVs is usually fine; ten can cut insert throughput by an order of magnitude. Audit attachments quarterly and consolidate aggregates where you can.

Forgetting that an MV is just a table is the cheapest mistake to fix and the most common in greenfield code. The planner treats it as a relation, and queries against it still need indexes for filters to be selective. A WHERE order_date >= now() - INTERVAL '7 days' against an unindexed MV with two years of daily rows is a full scan — exactly what the MV was supposed to prevent.

Backfilling around the source is the ClickHouse version of the same lesson. If historical data is loaded directly into the target, the source-side trigger never fires and the MV does not match the source. Always backfill through the source table and confirm row counts before declaring backfill complete.

If you want to drill data engineering questions like this every day, NAILDD is launching with hundreds of SQL and DE problems built exactly around materialized view and pipeline design scenarios.

FAQ

What is the practical difference between a view, a materialized view, and a table?

A view is a saved query with no stored data — every read re-executes. A materialized view is a saved query with a persisted result, which makes reads fast but introduces a freshness gap closed by refresh. A table is the canonical source with no derivation logic. Most production pipelines end in plain tables produced by dbt, with views layered on top for abstraction and MVs reserved for hot aggregates that need millisecond reads.

Should I use a ClickHouse MV or a dbt incremental model for the same aggregation?

Pick ClickHouse MVs when source, aggregation, and consumer all live inside ClickHouse and the source is append-only — incremental refresh is free and there is no orchestrator to operate. Pick dbt incremental when the pipeline has multiple stages with tests and documentation, when sources or targets may move across systems, or when the team's deployment workflow is already centered on dbt's CI. Many production stacks run both — ClickHouse MVs for the hottest real-time aggregates, dbt incremental for everything else.

How many MVs can I safely attach to one source table?

Three to five is a comfortable range with no measurable insert overhead. By ten, the cumulative cost of running every MV SELECT on every batch shows up in p99 insert latency and merge-backlog time. Monitor source-table insert latency and merge backlog as you add MVs, and consolidate aggregates into a single MV with broader granularity when you can.

Can a materialized view be built over a JOIN?

In Postgres yes, with the caveat that refresh re-runs the full join — a wide join on two large tables refreshes slowly. In ClickHouse a JOIN-based MV is treacherous because the MV only fires on inserts into the left-side source. The right side is read as-of-insert-time, so any later change is invisible and the aggregate silently drifts. In practice either denormalize the right side into the source before the MV, or use a refreshable MV that rebuilds the full join on a schedule.

Do ClickHouse materialized views need their own indexes?

ClickHouse does not have B-tree indexes the way Postgres does. The MergeTree family relies on a sparse primary index defined by the ORDER BY clause of the target table, plus optional skip indexes. When designing a ClickHouse MV, the ORDER BY of the target table is the most important performance choice — match the column ordering to the predicates and GROUP BY clauses downstream queries use most often.

Is this an official guide?

No. The material is based on Postgres and ClickHouse documentation, dbt's incremental model guides, and patterns from production data warehouse work. Treat it as interview prep, not a substitute for testing the mechanics on the database version your team runs.