How to calculate Cohen's kappa in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

What Cohen's kappa is and why it matters

Your trust and safety lead at Airbnb pings you on a Monday morning asking whether the two content moderators on the new policy queue actually agree. They have a 90% raw agreement number from the dashboard and want a launch readiness review by Wednesday. The number looks great until you remember the queue is 85% safe and 15% needs action — even coin-flipping raters would land near 90% by chance.

Cohen's kappa fixes this. It measures inter-rater agreement between two annotators, then subtracts the agreement you would have seen by chance given each rater's marginal label distribution. The result lives in [-1, 1]: 0 means agreement as often as random labeling, 1 means perfect agreement, negative values mean systematic disagreement. In the moderation example, 90% raw agreement might translate to a kappa of 0.35 — "fair" at best, not the green light leadership thought they had.

Analysts at Meta, Snowflake, and DoorDash reach for kappa when validating any human-labeled dataset. Content moderation, intent classification, ad-policy review — anywhere a human label sits between raw data and a model, kappa is the receipt that says the labels are reproducible. The same calculation compares model predictions against a human gold standard, treating the model as a second rater.

The kappa formula

Cohen's kappa is defined as the chance-corrected agreement between two raters:

kappa = (p_o - p_e) / (1 - p_e)

p_o is the observed agreement — the fraction of items where both raters chose the same label. p_e is the expected agreement under independence, computed from each rater's marginal distribution. If rater A picks "approve" 70% of the time and rater B 65%, the chance they both pick it by accident is 0.70 * 0.65 = 0.455. Summing that product over every label gives p_e.

The denominator 1 - p_e rescales the metric so perfect agreement maps to 1.0 regardless of class balance. When p_e is close to 1 — both raters using the same label almost always — the denominator collapses toward 0 and kappa becomes unstable. This is the famous "kappa paradox": high raw agreement on an imbalanced label can still produce a low kappa, so always interpret it next to the marginal distribution.

Cohen's kappa in SQL

Assume a dual_ratings table with one row per item and columns for both raters' labels. The full calculation in a single query for a 30-day window in Postgres or Snowflake looks like this:

WITH agreements AS (
    SELECT
        item_id,
        rater_a_label,
        rater_b_label,
        CASE WHEN rater_a_label = rater_b_label THEN 1 ELSE 0 END AS agreed
    FROM dual_ratings
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
),
totals AS (
    SELECT COUNT(*) AS n FROM agreements
),
p_observed AS (
    SELECT SUM(agreed)::NUMERIC / NULLIF((SELECT n FROM totals), 0) AS p_o
    FROM agreements
),
marginals_a AS (
    SELECT
        rater_a_label AS lbl,
        COUNT(*)::NUMERIC / NULLIF((SELECT n FROM totals), 0) AS p_a
    FROM agreements
    GROUP BY rater_a_label
),
marginals_b AS (
    SELECT
        rater_b_label AS lbl,
        COUNT(*)::NUMERIC / NULLIF((SELECT n FROM totals), 0) AS p_b
    FROM agreements
    GROUP BY rater_b_label
),
p_expected AS (
    SELECT SUM(COALESCE(ma.p_a, 0) * COALESCE(mb.p_b, 0)) AS p_e
    FROM marginals_a ma
    FULL OUTER JOIN marginals_b mb USING (lbl)
)
SELECT
    po.p_o,
    pe.p_e,
    (po.p_o - pe.p_e) / NULLIF(1 - pe.p_e, 0) AS kappa
FROM p_observed po, p_expected pe;

A few details matter. The FULL OUTER JOIN between the marginal CTEs handles the case where one rater uses a label the other never picked — without it, that label is silently dropped from p_e and kappa creeps upward. The COALESCE wraps the missing side as zero. The NULLIF protects against the degenerate case where both raters always pick a single label, which makes p_e = 1 and kappa undefined.

When the table is large and windows shift constantly, materialize agreement and marginal aggregates as daily snapshots. Both p_o and p_e are linear in confusion-matrix cell counts, so pre-aggregation composes cleanly.

Multi-class kappa

The formula generalizes to any number of categorical classes. Each label contributes one term to p_e based on the product of the two marginals, and observed agreement only checks equality. The SQL does not change when the queue moves from binary "approve/reject" to a multi-class taxonomy of policy reasons.

What does change is how often you encounter labels one rater used and the other did not. With a binary task this is rare. With a 12-class intent taxonomy at Linear, both raters often skip several classes on a small batch. The FULL OUTER JOIN plus COALESCE pattern handles this, but list marginals side by side before trusting the kappa number:

SELECT
    COALESCE(ma.lbl, mb.lbl) AS label,
    COALESCE(ma.p_a, 0) AS p_a,
    COALESCE(mb.p_b, 0) AS p_b
FROM marginals_a ma
FULL OUTER JOIN marginals_b mb USING (lbl)
ORDER BY label;

If this shows asymmetric marginals — one rater never touching a class the other uses 20% of the time — the kappa still computes, but it speaks more to guideline drift than to reproducibility.

Weighted kappa for ordinal labels

When the label set is ordinal — a 1 to 5 rating, a severity tier from "low" to "critical" — plain Cohen's kappa treats "5 vs 4" and "5 vs 1" as identical disagreements. Weighted kappa fixes this by attaching a cost to each off-diagonal cell of the confusion matrix. Quadratic weights are the default: the penalty grows with the squared distance between rater labels, so a single-step disagreement is much cheaper than a four-step gap.

The weighted formulation replaces agreed with a per-pair weight and applies the same weight to the expected matrix. Materialize the joint frequency table, multiply each cell by its weight, then sum observed and expected separately:

WITH classes AS (
    SELECT DISTINCT label_id FROM rating_scale
),
joint AS (
    SELECT
        rater_a_label,
        rater_b_label,
        COUNT(*)::NUMERIC AS n_ab
    FROM dual_ratings
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY rater_a_label, rater_b_label
),
totals AS (
    SELECT SUM(n_ab) AS n FROM joint
),
weighted_observed AS (
    SELECT
        SUM(j.n_ab * (1 - POWER(j.rater_a_label - j.rater_b_label, 2)
                          / POWER(max_diff.d, 2))) / (SELECT n FROM totals) AS w_o
    FROM joint j
    CROSS JOIN (
        SELECT MAX(label_id) - MIN(label_id) AS d FROM classes
    ) max_diff
),
marginals_a AS (
    SELECT rater_a_label AS lbl, SUM(n_ab) AS n_a FROM joint GROUP BY rater_a_label
),
marginals_b AS (
    SELECT rater_b_label AS lbl, SUM(n_ab) AS n_b FROM joint GROUP BY rater_b_label
),
weighted_expected AS (
    SELECT
        SUM(ma.n_a * mb.n_b / (SELECT n FROM totals)
            * (1 - POWER(ma.lbl - mb.lbl, 2) / POWER(max_diff.d, 2)))
            / (SELECT n FROM totals) AS w_e
    FROM marginals_a ma
    CROSS JOIN marginals_b mb
    CROSS JOIN (SELECT MAX(label_id) - MIN(label_id) AS d FROM classes) max_diff
)
SELECT
    wo.w_o,
    we.w_e,
    (wo.w_o - we.w_e) / NULLIF(1 - we.w_e, 0) AS weighted_kappa
FROM weighted_observed wo, weighted_expected we;

The query stores label codes as integers because quadratic weights are defined on the integer rank, not the label name. If ratings are stored as strings like "low", "medium", "high", join through a rating_scale dimension that maps each string to an ordinal position first.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Interpreting the kappa value

The Landis and Koch (1977) bands are the conventional reference. Values below 0 indicate disagreement worse than chance, which usually points to a flipped label mapping. Values from 0.0 to 0.20 are slight, 0.21 to 0.40 fair, 0.41 to 0.60 moderate, 0.61 to 0.80 substantial, and 0.81 to 1.00 almost perfect. Production labeling pipelines at Stripe, Notion, and Vercel typically gate on kappa above 0.70 before a dataset enters a training set.

These bands are heuristics, not laws. The kappa paradox can pull a high-agreement label set into the "fair" band purely because of a skewed marginal. Always pair kappa with the marginal distribution from the multi-class section above. If marginals are wildly imbalanced, supplement with Krippendorff's alpha or a confusion matrix walk-through.

Common pitfalls

The first pitfall is reporting raw agreement instead of kappa on imbalanced data. A binary moderation queue that is 95% safe and 5% needs-action hits 90% raw agreement even with raters who never look at the screen. Kappa cuts through this by subtracting expected chance agreement. Replace any dashboard that ships with only AVG(agreed) before it becomes the gold standard nobody questions.

The second pitfall is computing kappa across raters who use disjoint label vocabularies. If rater A uses a four-class taxonomy and rater B uses a three-class one, the inner-join pattern silently drops the missing class and inflates kappa. The fix is the FULL OUTER JOIN with COALESCE shown above, plus a guideline conversation before the next batch.

The third pitfall is treating ordinal labels as categorical. A churn risk rating of 1 through 5 reads as a category in the database, but "5 vs 4" is clearly closer than "5 vs 1". Plain Cohen's kappa loses that information; weighted kappa with quadratic weights restores it. Check whether the label set has an ordering before defaulting to the unweighted version.

The fourth pitfall is computing kappa over windows too small to matter. The variance of kappa shrinks with the square root of items, so a 50-item batch can swing by 0.15 from sampling noise alone. Set a minimum-sample guardrail — at least 200 items per pair, ideally 500 — before publishing kappa to stakeholders.

The fifth pitfall is comparing kappa across datasets with different marginal distributions. Kappa is sensitive to prevalence. A model evaluated against human labels on a 50/50 dataset produces a different kappa than the same model on a 90/10 dataset, even if the error rate is identical. For cross-dataset comparison, switch to Krippendorff's alpha or report the full confusion matrix.

Optimization tips

Cohen's kappa decomposes into a small number of aggregates: count of agreements, totals, and one marginal sum per label per rater. All linear in the row count, so the bottleneck is scanning the labeling table, not the math. Cluster or partition the dual-ratings table by rating timestamp so the 30-day window only scans the recent partition.

For high-volume labeling at OpenAI or Anthropic, precompute the joint confusion matrix as a daily summary with one row per (rater_a_label, rater_b_label, date) and a count. The summary is at most K * K * days rows, fits in memory for any realistic taxonomy, and powers weighted kappa, marginal listings, and the confusion-matrix view without rescanning raw events.

When kappa is reported per segment — taxonomy version, reviewer team, content language — push segmentation into the aggregates rather than the final SELECT. If the warehouse lacks FULL OUTER JOIN on the marginals, fall back to a UNION ALL of the two marginal sets, then LEFT JOIN back to the full label dimension.

If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

Cohen's kappa vs raw agreement — which one should I report?

Both, but lead with kappa. Raw agreement is the headline stakeholders intuitively understand, but it misleads on any imbalanced label distribution. Kappa subtracts chance agreement and produces a comparable number across base rates. Show raw agreement, kappa, and the marginal distribution side by side — that triple gives reviewers enough context to spot the kappa paradox before launch.

What kappa value should I target before approving a labeled dataset?

For production labeling that trains downstream models, target kappa above 0.70. For exploratory batches validating a new guideline draft, above 0.50 is usually acceptable. Below 0.40, treat the labeling task as not yet reproducible. These thresholds come from the Landis and Koch bands.

My kappa is negative. What does that mean?

Negative kappa means raters disagree more often than chance would predict. In practice this almost always traces to a flipped label mapping — one rater using a guideline that swaps two labels, or an off-by-one in the export label dictionary. Pull a sample of disagreements and look for systematic patterns. Genuine negative kappa from honest raters is rare.

Cohen's kappa vs Fleiss' kappa — when do I need which one?

Cohen's kappa handles exactly two raters. Fleiss' kappa generalizes to three or more raters where each item is labeled by an arbitrary subset of the pool. Fleiss' kappa averages pairwise agreement across all rater pairs per item, and the SQL is more involved because aggregation happens at the item level before global averaging. If your setup routes each item to a fixed pair, Cohen's kappa is correct. Otherwise switch to Fleiss'.

Should I use weighted or unweighted kappa for an ordinal scale?

Use weighted kappa with quadratic weights for any ordinal scale: ratings of 1 to 5, severity tiers, graded relevance. Plain Cohen's kappa treats every disagreement equally, losing the ordering structure and making a near-miss rater look as bad as a wild-miss rater. Quadratic weights correspond to a squared-distance penalty that aligns with how downstream models treat numeric error.

Can I use kappa to compare a model against human labels?

Yes — treat the model as the second rater. Compute kappa between the gold-standard column and the model's predicted-label column on the same evaluation set, using the same SQL pipeline. The kappa trend over time becomes the canary for drift between model and labeling guideline.