Federated learning on the Data Science interview
Contents:
Why federated learning shows up in DS interviews
A recruiter from Google, Apple, or a healthcare startup asks you to whiteboard a recommendation system for a mobile keyboard, and halfway through, the interviewer drops the line: "By the way, we can't ship raw keystrokes to a server." That is the moment federated learning stops being an exotic paper and becomes the load-bearing trick of the whole answer. The model has to learn from millions of devices without the data ever leaving them.
The pattern repeats across hospital networks, banks running fraud detection across institutions, and any consumer app touching keystrokes, locations, or biometrics. Interviewers ask about FL because they want to know if you can reason about gradient aggregation, secure aggregation, and differential privacy without confusing them. They are also testing whether you understand the cost: federated training is slow, communication-bound, and statistically painful when client data is non-IID.
This post is the version of the topic that will get you through a senior DS loop at Apple, Google, or a regulated-industry startup. The bar is not memorizing the FedAvg pseudocode. The bar is explaining why every line of it exists, what breaks in practice, and what privacy guarantees the system actually provides.
The FedAvg algorithm in detail
FedAvg, introduced by McMahan et al. in 2017 at Google, is still the baseline every federated system is benchmarked against. The setup is a server holding a global model and a population of clients holding local data shards. In each round, the server samples a subset of clients, ships them the current weights, lets each one run a few local SGD epochs, and then averages the returned updates weighted by sample count.
1. Server initializes global weights w_0.
2. For each round t = 0, 1, 2, ...:
a. Server samples K clients from the population.
b. Server broadcasts w_t to those K clients.
c. Each client i runs E local epochs of SGD on its n_i samples,
producing local weights w_i_{t+1}.
d. Each client sends w_i_{t+1} (or the delta) back to the server.
e. Server aggregates:
w_{t+1} = sum(n_i * w_i_{t+1}) / sum(n_i)
3. Repeat until convergence or budget is exhausted.The two knobs that matter are E (local epochs per round) and K (clients per round). Push E up and you save bandwidth but drift each client further from the global optimum, which hurts convergence on non-IID data. Push K up and you reduce variance per round but each round becomes slower because the server waits for the slowest device.
Load-bearing trick: FedAvg is just weighted federated SGD with multiple local steps. Everything fancy on top — FedProx, SCAFFOLD, FedNova — is a correction term that fights client drift when local data distributions disagree.
A few variations worth naming if asked: FedProx adds a proximal term (mu/2) * ||w_i - w_t||^2 to the local loss so clients cannot drift too far. SCAFFOLD maintains control variates that estimate the direction of client drift and correct for it. FedNova normalizes updates by the number of local steps so heterogeneous clients do not over-contribute.
Privacy techniques layered on top
Vanilla FedAvg leaks information. A model update is essentially a noisy summary of the client's local gradient, and recent attack literature shows you can reconstruct training images from gradients in some settings. So in any serious deployment, FL is paired with cryptographic and statistical privacy mechanisms.
| Technique | What it protects | Cost | Where it shines |
|---|---|---|---|
| Secure aggregation | Server cannot see individual updates, only the sum | Extra crypto round trips, modest compute | Cross-device with millions of phones |
| Differential privacy (client-level) | Bounds info leak about any single client | Accuracy drops, often 1-5 percentage points | Regulated data, keyboard models |
| Homomorphic encryption | Server computes on ciphertext directly | Extremely expensive, often 100-1000x | Small cross-silo, high-stakes data |
| Trusted execution environments | Hardware enclave isolates computation | Hardware lock-in, side-channel risk | Bank-to-bank, vendor-controlled silos |
Secure aggregation is the bread and butter of cross-device FL at Google. The protocol uses pairwise masks that cancel out only when a threshold of clients participate, so the server sees the sum of updates but no individual contribution. This is also why dropout-resilience is built into the protocol — phones lose connectivity mid-round all the time.
Differential privacy typically comes in two flavors. Local DP adds noise on the client before the update leaves the device; the privacy guarantee is strong but accuracy suffers heavily. Central DP assumes a trusted aggregator and adds noise after aggregation; weaker trust model but much better accuracy. Production keyboard models usually run central DP with epsilon between 1 and 10 per training run, depending on how aggressively the legal team prices the privacy budget.
Cross-device vs cross-silo
The two FL deployment regimes look identical on the whiteboard and almost nothing alike in production. Interviewers love asking you to compare them because the answer reveals whether you have actually thought about scale.
In cross-device FL, you have millions to hundreds of millions of clients, each holding a tiny shard of data — a few hundred to a few thousand examples. Phones, watches, smart speakers. Connectivity is intermittent, devices are heterogeneous, and the population shifts as people charge their phones or switch them off. You sample maybe 100 to 1,000 clients per round out of the millions available. Examples in the wild: Google Gboard's next-word prediction, Apple's on-device personalization, Samsung's voice models.
In cross-silo FL, you have a handful of clients — maybe 5 to 50 organizations — each holding a large dataset. Hospitals collaborating on a tumor classifier. Banks pooling fraud signals without sharing customer rows. Clients are stable, well-resourced servers, almost always available, and every round can include every silo. The hard problems shift: governance, contractual trust, and IID violations because each organization has a biased view of the global population.
Sanity check: If the interviewer says "10 clients, each with millions of records," you are in cross-silo and FedAvg with secure aggregation is overkill. If they say "100 million clients, each with 200 records," you are in cross-device and every design decision is dictated by communication cost.
Production applications
Mobile keyboards are the canonical case. Gboard trains next-word and emoji prediction models on what users actually type without ever uploading the keystrokes. The model improves week over week, the keystrokes never leave the device, and Google publishes the privacy parameters.
Medical research is the other obvious fit. Several hospitals can train a tumor segmentation model collaboratively without sharing imaging data, which would otherwise be blocked by HIPAA, GDPR, and institutional review boards. The NVIDIA Clara FL platform and the Owkin consortium are both production examples.
Cross-bank fraud detection is a growing cross-silo use case. No bank wants to hand over transaction history, but each one wants the benefit of a model trained on the global fraud surface. FL plus differential privacy lets them pool gradient signal without pooling raw data.
IoT and edge is the long tail — wearables, smart cameras, industrial sensors. Anywhere bandwidth is expensive or the raw signal is sensitive (audio, video, location), federated learning beats centralizing the data.
Common pitfalls
The first pitfall is conflating federated learning with privacy. FL is a training paradigm; on its own it provides no formal privacy guarantee. A weakly-trained gradient still leaks information about the underlying data, and several attack papers show you can reconstruct training samples from gradients in low-batch settings. The fix is to always pair FL with secure aggregation, differential privacy, or both, and to know which threat model each one defends against.
The second pitfall is ignoring non-IID data. Textbook FedAvg assumes clients have data drawn from the same distribution. In reality, each phone, each hospital, and each bank sees a heavily skewed slice. When data is non-IID, local SGD drifts away from the global optimum, and naive averaging produces a worse model than centralized training would. Senior interviewers expect you to bring this up unprompted and to name FedProx, SCAFFOLD, or personalized FL as mitigations.
A third trap is underestimating communication cost. Each round of FedAvg requires shipping a full model copy to every selected client and pulling updates back. For a 100M-parameter model and 1,000 clients per round, that is hundreds of gigabytes per round. Solutions are gradient compression, sparsification, and quantization, plus careful tuning of how many local epochs E you can squeeze in before drift dominates.
A fourth pitfall is assuming clients are honest. In cross-device, some fraction of phones will return garbage updates due to bugs, hardware faults, or malicious users trying to poison the model. Robust aggregation like median, trimmed mean, or Krum is needed in adversarial settings, and even then differential privacy is what bounds the worst-case damage a single client can inflict.
The fifth, most quietly fatal pitfall is evaluating only on a central holdout. A model that averages well on a server-side test set can still perform terribly on the device it was trained for, because the global average smooths over personalization. Production FL teams ship a per-client evaluation and report distribution statistics — median accuracy, 10th percentile, 90th percentile — not just the mean.
Related reading
- MLOps for the Data Science interview
- MLOps monitoring for DS interviews
- Bias and fairness for DS interviews
- ML latency optimization
- Bayesian optimization interview
If you want to drill production-ML interview questions like this every day, NAILDD is launching with thousands of DS problems spanning exactly this surface area.
FAQ
Is federated learning a privacy mechanism by itself?
No. FL only changes where computation happens — it moves training onto client devices instead of centralizing raw data. It says nothing about what an attacker can infer from the gradients themselves. To get a real privacy guarantee, you layer secure aggregation (so the server cannot see individual updates) and differential privacy (so the aggregate itself does not leak too much about any one client) on top. Treating FL as automatic privacy is one of the fastest ways to fail a senior interview question on this topic.
Why does FedAvg fail on non-IID data?
Each client runs multiple local SGD steps before reporting back. When their local data distribution differs from the global one, those local steps pull the weights toward a local optimum that is far from the global one. Averaging those drifted weights produces an update that is not the average gradient — it is a noisy compromise that can actually move the global model in the wrong direction. FedProx, SCAFFOLD, and FedNova all add correction terms that fight this drift in different ways. The empirical takeaway: more local epochs E is great when data is IID and dangerous when it is not.
How does secure aggregation actually work?
The core construction uses pairwise masks generated from shared secrets between pairs of clients. Each client adds masks to its update such that, when all updates are summed, the masks cancel out to zero. The server sees the unmasked sum but cannot recover any individual contribution. Real protocols add dropout resilience via Shamir secret sharing, so the system still works if a fraction of clients disconnect mid-round. This is the protocol that powers Gboard at production scale.
What is the relationship between FL and differential privacy?
They are orthogonal but composable. FL controls where data lives. DP bounds how much information about any one record leaks into the model. You can run FL without DP (privacy guarantee depends on trust assumptions), DP without FL (centralized training with noisy gradients), or both together (the strongest combination, used by Google for Gboard). Production systems almost always combine them, with central DP applied to the aggregated update on the server for the best accuracy-privacy tradeoff.
What is the typical communication overhead of FL?
For a model with M parameters in float32 and K clients per round, each round costs roughly 2 * M * 4 bytes * K of communication (download plus upload). For a 50M-parameter model with 1,000 clients per round, that is about 400GB per round. Hundreds of rounds are typical to convergence. This is why gradient compression, quantization to int8 or smaller, and update sparsification are not optional in serious deployments — they are the difference between a system that trains in a week and one that never finishes.
When should I not use federated learning?
When the data can legally and ethically be centralized, do not use FL. Centralized training is faster, more accurate, easier to debug, and supports richer architectures. FL is the right tool when you face a hard constraint — regulation, contract, or product positioning — that prevents you from collecting raw data. The rough cost of going federated is a 10-100x slowdown in iteration speed and a 1-5 percentage point accuracy hit, both of which are worth it when the alternative is shipping nothing at all.