Hallucinations and LLM evals on a Data Science interview
Contents:
Why this comes up on DS interviews
Every team shipping an LLM feature in 2026 has the same scar tissue: the model sounded confident, and it was wrong. That is what hiring managers at OpenAI, Anthropic, Notion, and every Series B startup with a chatbot are screening for. They want to know whether you can detect, measure, and reduce hallucinations — not just define the word.
The question almost always arrives in two phases. First: "Your RAG pipeline gives wrong answers 8% of the time. How do you debug it?" Then: "How would you set up offline evals so this never regresses?" If your answer is "we'd ask users to thumbs-down," you are out. If you can talk through faithfulness vs answer relevancy, calibrated judges, and pairwise comparisons, you stay in the loop.
Senior candidates separate themselves here. Junior DS memorize that hallucinations exist. Senior DS tell you which decoding temperature they used and how they catch silent regressions when the base model is upgraded.
What a hallucination actually is
A hallucination is output that is fluent, syntactically clean, and assertively confident — but factually or contextually wrong. The fluency makes it dangerous. Users trust well-formed prose, and modern LLMs almost never produce ungrammatical text, so the usual surface cues for "this seems off" do not fire.
There are three flavors worth naming on an interview:
| Type | Example | How you usually catch it |
|---|---|---|
| Factual | "The Eiffel Tower is in London." | Cross-reference against a knowledge base |
| Source attribution | "According to RFC 9999, TLS requires..." (no such RFC) | Citation validators, link checkers |
| Coherent but wrong | Plausible reasoning chain, false conclusion | LLM-as-judge with retrieved ground truth |
Load-bearing distinction: factual hallucinations are about the world, source-attribution hallucinations are about made-up references, and coherent-but-wrong outputs slip past every shallow check because the text reads correct. You need different evals for each.
The interview trap here is conflating all three. If you tell the interviewer "we'd just add a fact-checker," they will ask which type — and a fact-checker that catches Type 1 may completely miss Type 3.
Causes of hallucinations
The mechanical answer is the model is doing what it was trained to do — predict the next token under a probability distribution. It is not lying; it has no internal flag for "I am unsure." But on an interview you need to be specific about which failure mode produces which symptom.
Out-of-distribution queries. The prompt covers a domain the training data barely touched. The model interpolates from related concepts and produces something that looks right but isn't.
Stale knowledge. Training cut-offs leave the model blind to anything after a certain date. Ask about an event from last month and you'll get confident fiction unless the system grounds the answer in fresh retrieved context.
Reasoning chain failures. Multi-step problems compound error at each hop. A model that gets each sub-step right with 95% accuracy still only nails a 5-step chain about 77% of the time.
Decoding randomness. High temperature plus high top_p produces creative output and creative output is, by construction, less faithful. This is why production summarization runs at temperature 0.0–0.3 and creative writing runs at 0.7–1.0.
Insufficient retrieved context. In RAG, the retriever is often the real villain. If the top-K chunks don't contain the answer, the generator has no grounding and falls back on parametric memory — which is exactly when hallucinations spike.
Mitigation stack
There is no single fix. Real teams stack four to six interventions and measure each in isolation. Walking through the stack in order signals you've shipped this.
Retrieval-augmented generation (RAG) is the first lever. You retrieve relevant documents and instruct the model to ground its answer in them. The retriever does most of the work — a precise retriever beats a clever prompt every time. If retrieval recall is below 80%, no prompt engineering will save you.
Citations. Force the model to cite spans from the retrieved context. This is a verification primitive — it lets a downstream evaluator check whether the cited span actually supports the claim. Claude has native citation support; for other models, post-process with a validator.
Lower decoding temperature. For factual tasks, temperature near 0 cuts hallucination rate measurably. The trade-off is repetitive phrasing — usually worth it.
Chain-of-thought (CoT). Reasoning step-by-step improves accuracy on math, logic, and multi-hop QA. The catch: CoT also gives the model more room to compound errors on tasks where reasoning isn't needed. Test it; don't assume it helps.
Verification pass. Run a second model — often smaller and cheaper — to check the first model's answer against the retrieved context. This is the workhorse of every serious production stack and almost nobody mentions it on interviews.
Fine-tuning for honesty. RLHF with raters who reward "I don't know" over confident guessing. Expensive, but it shifts behavior in a way prompting can't.
Constrained generation. When the output should be JSON or match a regex, enforce the schema at decode time. You can't hallucinate a field the constraint forbids.
LLM-as-judge
LLM-as-judge means using a strong model (typically GPT-class or Claude Opus) to grade the output of a smaller or older model. It's the dominant pattern for scalable offline evaluation because human labeling does not scale past a few thousand samples per week.
The pros are obvious — you can run thousands of grading calls overnight against a regression dataset and get a numeric score. The cons are subtler and more important on an interview.
| Concern | What happens | Mitigation |
|---|---|---|
| Family bias | GPT-4 prefers GPT-style output, Claude prefers Claude-style | Use multiple judges from different families, average |
| Position bias | Pairwise judges favor the first answer shown | Randomize order, run both A→B and B→A |
| Verbosity bias | Judges over-reward longer answers | Penalize length in the rubric, or use pairwise |
| Inconsistency | Same input grades differently across runs | Set judge temperature to 0, average ≥3 runs |
| Cost | Strong judges are expensive at scale | Sample to a stratified eval set, not the full traffic |
The single best practice is pairwise comparison with randomized order over absolute 1–5 scoring. Pairwise is more robust to rubric drift and gives you a directly interpretable win-rate signal.
Sanity check: before you trust a judge, validate it against at least 200 human-labeled examples. If the judge–human agreement is below 70%, throw the judge out — you're measuring noise.
RAGAS for RAG systems
RAGAS is the de facto open-source framework for evaluating retrieval-augmented generation. It pins down four metrics that map onto the failure modes above.
Faithfulness measures whether every claim in the generated answer is supported by the retrieved context. Low faithfulness = the model is making things up despite having ground truth in front of it.
Answer relevancy measures whether the answer is on-topic for the question. High faithfulness with low relevancy means you're correctly summarizing the wrong document.
Context precision measures how relevant the retrieved chunks are at the top of the ranking. If your top-3 chunks are noise, precision will be low even if recall is fine.
Context recall measures whether the needed information was retrieved at all. This is the metric that catches a broken retriever — even a perfect generator can't ground in chunks it never received.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
result = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
)
print(result)On an interview, the sharp move is to map a symptom to a metric. "Users complain answers feel off-topic" → answer relevancy. "Answers cite chunks that don't actually contain the claim" → faithfulness. "Some questions are unanswerable even with the docs available" → context recall. That mapping is what gets you to the offer.
Standard benchmarks
You should know these by name and one-line purpose. Interviewers will not ask for deep internals, but they will absolutely ask which benchmark fits which use case.
| Benchmark | What it tests | When to cite |
|---|---|---|
| MMLU | Multi-choice knowledge across 57 subjects | General-purpose model quality |
| HellaSwag | Common-sense sentence completion | Reasoning, plausibility |
| HumanEval | Python code generation, unit-test pass-rate | Coding assistants |
| GSM8K | Grade-school math word problems | Multi-step arithmetic reasoning |
| MT-Bench | Open-ended chat, LLM-as-judge | Conversational quality |
| Chatbot Arena | Human pairwise comparisons, Elo | Real-world preference |
| TruthfulQA | Resistance to common misconceptions | Honesty / hallucination |
The honest answer for any production system is: public benchmarks are necessary but not sufficient. They get you to "this model is roughly in the right tier." For domain quality, you build a custom eval set of 500–2,000 examples from real user queries, label them once, and run it against every new model version.
Common pitfalls
The most common pitfall is treating a single number as the eval. Teams report "our faithfulness score is 0.84" and move on, but that aggregate hides the long tail where it matters most. The fix is to always report eval scores by slice — by query type, by retrieval difficulty, by user segment. A model with 0.84 average and 0.40 on the high-stakes slice is a worse model than one with 0.78 flat.
Another trap is evaluator–generator overlap. If you use GPT-4 to judge GPT-4, you're measuring how well GPT-4 agrees with itself, not how good the answers are. The fix is to use a judge from a different model family, or — better — to rotate judges and report the agreement rate alongside the score. Pairwise A/B with a different-family judge is the gold standard.
A third pitfall is golden-set rot. The 500 examples you labeled in January reflect January's product, January's users, and January's failure modes. By Q3, your product surface has changed and the golden set no longer covers the actual failure distribution. Refresh at least quarterly, and version your eval sets the way you version your model weights.
The fourth pitfall is conflating online and offline signals. Offline evals (RAGAS, LLM-as-judge, human-labeled accuracy) measure correctness on a fixed dataset. Online signals (thumbs-down rate, retry rate, abandonment) measure user experience on live traffic. They correlate weakly. Strong offline numbers that ship to a thumbs-down spike means your eval set is missing something real users care about.
A fifth pitfall, specific to RAG, is evaluating the generator without holding retrieval constant. If you change embeddings, chunking, and the prompt in one experiment, you cannot attribute the metric movement to any single change. Lock the retriever, vary the generator. Lock the generator, vary the retriever.
Related reading
If you want to drill DS interview questions like this every day, NAILDD is launching with 1,500+ problems across exactly this pattern.
FAQ
Can we fully eliminate hallucinations?
No, and any candidate who claims otherwise is signaling they have not shipped this. You can drive the rate down substantially with the stack above — RAG, citations, low temperature, verification pass, fine-tuning for honesty — but the floor is not zero. Production teams measure the residual rate and route high-stakes queries to a more conservative pipeline (often "I don't know" with a deflection to human review) rather than chasing perfection.
How big should my eval set be?
For a custom domain eval, 500 examples is a defensible minimum and 2,000 is comfortable. Below 500 the confidence intervals on your metrics are too wide to detect meaningful regressions. Above 2,000 you hit diminishing returns and judge-API cost becomes the constraint. Stratify by query type and difficulty so you can read per-slice metrics, not just the mean.
Should I use LLM-as-judge or human raters?
Both. Humans are ground truth and you need a few hundred human-labeled examples to validate that your judge agrees with humans at 70%+ agreement. Once the judge is calibrated, run it at scale — judges scale, humans don't. Re-validate the judge every time you upgrade the judge model or change the rubric.
Is RAGAS the only RAG eval framework?
It's the most-used open-source one, but it's not the only option. TruLens, DeepEval, and Arize Phoenix all cover similar ground with different trade-offs. The metrics are largely the same — faithfulness, answer relevancy, context precision, context recall — what differs is the integration surface and the dashboard. Pick the one your team will actually look at.
How do I detect a silent regression when the base model is upgraded?
Run your custom eval set against the new model the same day you get access, and compare to the previous baseline. Look at per-slice metrics, not just the aggregate — a new model can be better on average and worse on your high-stakes slice. Also check distributional shifts: longer answers, more hedging, more refusals. If your refusal rate triples overnight, that's a regression even if accuracy held.
How does temperature interact with hallucination rate?
Higher temperature increases sampling diversity, which increases the chance of low-probability (and often wrong) tokens. For factual or RAG-grounded tasks, temperature 0.0–0.3 is standard. For creative work, 0.7–1.0. Low temperature gives repetitive phrasing across similar prompts, but it's usually the right call for production.