Feature store in the data science interview
Contents:
Why feature stores show up in DS and MLE loops
If you are interviewing for a senior data scientist or machine learning engineer role at Stripe, DoorDash, Uber, Airbnb, Netflix, or a comparable ML-heavy shop, the system design round will almost certainly ask about train/serve skew — the most common production failure mode for ML systems. The feature store is the architectural answer the industry converged on, and interviewers expect you to discuss it without hand-waving.
The question rarely arrives as "what is a feature store". It arrives as "your model has 92% offline AUC but online lift is zero — what do you check first?" or "design a fraud-scoring service that answers in under 50 ms p99 with features over 30 days of history". Both are really the same question: how do you guarantee the features the model saw in training are the same features it sees in production?
A clean, structured answer signals seniority. Mumbling "we just rerun the SQL" signals you have not run an ML model in production. The surface area is small, the trade-offs are crisp, and the three or four load-bearing concepts fit in a single interview answer.
What a feature store actually is
A feature store is a centralized service for four things: storage of features, versioning of definitions, time-travel queries that return what a value was at a historical moment, and dual-write serving so the same feature is available for batch training and millisecond inference.
Concept diagram — the version interviewers want on the whiteboard:
Raw events ──► Feature pipelines ──► Feature Store ──► Training (batch)
└────────► Serving (real-time)The store is not a database in the conventional sense — it is a thin layer above one or more databases that enforces a single source of truth for feature definitions. The Postgres or Snowflake table underneath is implementation detail; what matters is that avg_orders_30d exists in exactly one place and is computed by exactly one piece of code regardless of who consumes it. Most candidates get this wrong by describing a key-value cache and stopping — a pure cache does not solve train/serve skew, because cache and training pipeline can still compute the value differently.
Offline vs online store
Every feature store has two halves — identical from the API, completely different from the infrastructure side.
| Aspect | Offline store | Online store |
|---|---|---|
| Use case | Batch training, backfills, analysis | Real-time inference, online scoring |
| Typical backend | Snowflake, BigQuery, S3 + Parquet, Delta Lake | Redis, DynamoDB, Cassandra, ScyllaDB |
| Latency target | seconds to minutes | under 10 ms p99 |
| Storage size | terabytes to petabytes | tens to hundreds of GB |
| History kept | full, with timestamps | current value only |
| Read pattern | scan or join, billions of rows | point lookup by entity key |
| Cost driver | storage + compute on warehouse | RAM and write QPS |
Offline answers "what was this user's average order value as of 2025-11-04 14:00 UTC". Online answers "what is it right now, in under ten milliseconds, while the request thread is blocked".
A feature in both stores must be synced by a pipeline that periodically materializes the offline computation into the online cache. Materialization cadence is itself an interview-worthy decision: hourly is common, every five minutes is achievable, true streaming materialization with Flink or Kafka Streams is rare and expensive.
Load-bearing rule: the offline store keeps history so you can reconstruct what the model would have seen; the online store keeps the latest value so the model can be served quickly. If you confuse the two, you either burn money on RAM or you serve stale features.
Train/serve consistency
This is the section the interviewer is actually grading. Get it right and the loop softens; get it wrong and "fundamental understanding of production ML" goes on your debrief in red.
The problem in one sentence: during training you compute features with heavyweight SQL against a warehouse over weeks of history; during serving you need the same feature for one user in single-digit milliseconds. If the two paths use different code, the values drift, and the model that scored AUC 0.91 offline scores nothing useful online.
Three concrete sources of skew show up in interviews. Definition skew — the training query and the serving code compute slightly different things; the fix is a single declarative feature definition both paths consume. Temporal skew — the training data accidentally includes information from after the prediction timestamp; this is the classic leakage bug, and it is the reason point-in-time joins exist. A point-in-time join asks "what was the feature as of the moment the label became known", not "as of right now". Freshness skew — the online value is stale because materialization is late or broken; the fix is staleness monitoring as a first-class SLO, with alarms when online lags offline by more than the agreed budget (typically 5 to 15 minutes for batch features, under 30 seconds for streaming).
Pseudo-Python for a definition both paths consume:
@feature_view(
entities=["user_id"],
ttl=timedelta(days=1),
online=True,
offline=True,
)
def avg_orders_30d(user_id):
return """
SELECT user_id, AVG(amount) AS value
FROM orders
WHERE event_ts BETWEEN @as_of_ts - INTERVAL 30 DAY AND @as_of_ts
GROUP BY user_id
"""Offline runs the SQL with @as_of_ts set to the label time per training row. Online materializes the same SQL on a schedule and stores the latest result in Redis. Same definition, two engines, one source of truth.
Tools: Feast vs Tecton vs Hopsworks vs in-house
You will probably be asked which tool you would pick. No single right answer, but two wrong ones: "I would build it myself" when the scenario justifies an off-the-shelf system, and recommending Tecton for a team running one model.
| Tool | License | Strength | Weakness | Best fit |
|---|---|---|---|---|
| Feast | Open source (Apache 2.0) | Simple, declarative, plug-and-play backends | No compute layer — you bring the pipelines | Mid-size teams with existing warehouse and Redis |
| Tecton | Commercial | End-to-end: compute + storage + monitoring | Pricey, vendor lock-in | Series-C and up shops without an ML platform team |
| Hopsworks | Open source + enterprise | Strong on point-in-time correctness, on-prem friendly | Smaller community than Feast | EU shops with data-residency rules |
| SageMaker / Vertex | Cloud-managed | Tight integration with AWS / GCP ML stack | Hard to leave the cloud later | Teams already all-in on one cloud |
| In-house Postgres + Redis | Free, your time | Total control, no abstraction tax | You will rebuild point-in-time joins, badly | Startups with fewer than 5 models and a strong engineer |
Feast example — what most candidates have actually touched:
from feast import Entity, FeatureView, Field
from feast.types import Float32
from datetime import timedelta
user = Entity(name="user_id")
avg_orders = FeatureView(
name="user_avg_orders_30d",
entities=[user],
ttl=timedelta(days=1),
schema=[Field(name="value", dtype=Float32)],
source=BigQuerySource(table="metrics.user_avg_orders"),
online=True,
)The same definition is read by training (point-in-time join over history) and by serving (store.get_online_features(...) hitting Redis).
Gotcha: Feast does not compute features. The BigQuerySource is a pre-aggregated table somebody else built. Say "Feast handles the SQL aggregations" and interviewers will press you — Tecton handles that, Feast does not.
When you actually need one
A feature store is not free — it adds infrastructure, on-call surface area, and a learning curve. It starts paying off when at least three of these are true: two or more models share features, real-time predictions ship in the product, train/serve skew has already burned you, the ML team crosses five engineers, and the warehouse-to-Redis glue has grown into bespoke scripts nobody trusts.
It is overkill when there is one model, batch inference only, fewer than ten features, or you are pre-revenue and still validating the idea. In those cases a well-named SQL view plus a nightly export to Parquet is the right architecture, full stop.
Common pitfalls
Senior interviewers love these — they separate candidates who read the Feast docs from candidates who got paged at 3am for a stale feature.
The first pitfall is treating the feature store as a model quality tool. It is not. A feature store guarantees the model sees the same input in train and serve; it says nothing about whether those inputs are predictive. Teams that ship a feature store expecting accuracy to go up will be disappointed. Feature stores reduce skew, they do not improve features.
The second is forgetting TTL on the online store. Without time-to-live, every key written stays in Redis or DynamoDB forever, and within a quarter the store outgrows its memory budget. Set a TTL — typically 24 to 72 hours for user-level features — longer than the materialization cadence but shorter than the natural churn of the entity. This sounds boring until you get paged for "OOM on the feature-serving cluster" at 2am on a Saturday.
The third is skipping point-in-time correctness during training data generation. Most tutorials write the join as JOIN features ON user_id and stop there, silently leaking future information into the training set. The model looks brilliant offline and fails on launch. Always join with an AS OF clause: JOIN features ON user_id AND feature_ts <= label_ts AND feature_ts > label_ts - ttl. Every feature store worth using has a helper for this.
The fourth is doing heavy computation in the online path. The online store is a lookup, not a compute layer. If a feature requires a 30-day window over a billion rows, the aggregation belongs in the offline materialization job and the result belongs in Redis. Running that SQL at request time will blow your p99 latency budget by three orders of magnitude and fail the system-design round.
The fifth is ignoring backfill before launch. A new feature does not exist historically; if you turn it on for serving and immediately start training on it, you have one day of data and a useless feature. Backfill the offline store across the relevant history, validate the distribution, enable online materialization, then let models consume it. Treat backfill as a deploy step, not an afterthought.
Related reading
- SQL window functions interview questions
- A/B testing peeking mistake
- Cohort analysis: data science interview
- Bayesian methods: data science interview
If you want to drill ML system design and feature-engineering questions like this every day, NAILDD is launching with hundreds of DS and MLE interview problems across exactly this pattern.
FAQ
Is Feast production-ready, or should I just use Tecton?
Feast has been in production at Tubi, Robinhood, and a long tail of mid-size ML teams. It is production-ready for the case it targets: a team that already has a warehouse and a key-value store and needs a declarative layer to unify training and serving. It is not an ML platform — you still need Airflow or dbt or Flink to actually produce the feature values. Tecton bundles that compute layer and is the right pick when you do not have a platform team, but you pay for it. The decision is mostly about whether you already own the pipelines.
What is the difference between a feature store and a data warehouse?
A warehouse stores arbitrary data for arbitrary consumers. A feature store stores ML features specifically and adds two things the warehouse does not natively offer: a serving path with millisecond latency, and point-in-time-correct joins that prevent label leakage. Think of the feature store as a thin, opinionated layer above the warehouse. Most feature stores are physically backed by the warehouse for the offline half and by a key-value store for the online half.
Do I need a feature store if I only do batch inference?
Probably not. If every prediction is generated by a nightly job that reads the warehouse and writes a table, you already have a feature store — it is called your warehouse. Adding Feast or Tecton here buys you a declarative catalog and lineage, which is nice for governance but does not solve a real production problem. Wait until you ship a real-time use case before paying the complexity tax.
How does train/serve consistency actually fail in practice?
The most common pattern is two engineers, two implementations. The data scientist writes the training query in SQL against the warehouse. The backend engineer translates that into a Python function that reads recent events from Kafka, aggregates them in memory, and returns the value. The two implementations agree on the happy path but disagree on edge cases — null handling, timezone of the event timestamp, whether refunded transactions count, whether the 30-day window is rolling or anchored to UTC midnight. Each disagreement is a small bias that compounds. A feature store eliminates this by making both paths consume one definition.
Are these tools required for the interview, or can I just describe the concept?
You can pass most interviews by describing the concept: dual stores, single definition, point-in-time correctness, materialization. Naming Feast or Tecton signals familiarity but is not required. What is required is answering follow-ups — "what backend for online", "how to detect a stale feature", "cost trade-off of every five minutes vs hourly". Tools are vocabulary; architecture is the answer.