How to calculate device fingerprinting in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

What device fingerprinting actually is

A device fingerprint is a stable-enough hash of the signals a browser or app leaks on every request. The point is to recognise the same machine across sessions, even when the user clears cookies, signs out, or creates a new account. If somebody at DoorDash deletes their account and registers a fresh one to harvest the $10 first-order coupon again, a working fingerprint catches them before the discount fires.

The interview version of this question shows up in fraud and trust-and-safety screens at Uber, Stripe, Airbnb, and any consumer marketplace where the unit economics depend on a real new user not being the same person three accounts deep. You usually get a session events table and a one-line ask: "find devices behind more than one account in the last 30 days." The trap is that interviewers want to see you build the hash deterministically and then reason about false positives — not just GROUP BY user_agent.

The most useful framing in a real job is that fingerprinting is a join key, not a verdict. You build the hash, attach it to every session, and then let downstream rules — velocity, refund history, IP reputation — decide what to do. Treating the hash itself as proof of abuse is how you end up banning a family that shares one laptop.

Which signals go into the hash

The standard client-side signals are user-agent, screen resolution, timezone, system language, canvas hash, WebGL renderer string, audio context fingerprint, and the list of installed fonts. Each one alone is weak — millions of users run Chrome at 1920x1080 in UTC — but combined they sit at roughly 18-22 bits of entropy on the open web, which is enough to recognise the same browser across a quiet weekend.

Server-side, you get IP, ASN, and geo, but treat them as soft signals. IPs rotate, mobile carriers NAT thousands of phones behind one address, and any motivated abuser is already on a residential VPN. Use them as features in a downstream model, not as part of the fingerprint hash itself, otherwise the hash becomes unstable the moment a user switches from home Wi-Fi to LTE.

A common production split is to store one strict fingerprint that includes canvas and WebGL, and one loose fingerprint that omits them. The strict one is for high-confidence matches; the loose one is for "probably the same person, worth a soft prompt." Building both costs you one extra column in the events table.

The SQL formula

Hash the concatenation of the signals you trust:

SELECT
    user_id,
    session_id,
    MD5(
        COALESCE(user_agent, '') || '|' ||
        COALESCE(screen_resolution, '') || '|' ||
        COALESCE(timezone, '') || '|' ||
        COALESCE(language, '') || '|' ||
        COALESCE(canvas_hash, '')
    ) AS device_fingerprint
FROM session_events
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days';

MD5 is fine here. You are matching, not signing — collision resistance against a determined attacker is irrelevant when the attacker controls the inputs anyway. If your security team asks for SHA-256, give them SHA-256; the cost is a few milliseconds on a Snowflake warehouse and zero analytical difference.

The COALESCE calls matter more than they look. If canvas_hash is null for 12% of your traffic and you concatenate without coalescing, those 12% all hash to the same value in some dialects and to different values in others. That single inconsistency is the most common reason a fingerprint pipeline looks fine on a dashboard and then misses obvious dupes in production.

The separator (| in the example) needs to be a character that cannot appear inside any of the fields. If user-agent strings ever contain | in your data — they sometimes do, especially mobile SDK builds — switch to a delimiter like \x1f (unit separator) or hash each field independently and concatenate the digests.

Multiple accounts on one device

The standard "find abuse" query groups by fingerprint and counts distinct accounts:

WITH fingerprints AS (
    SELECT
        user_id,
        MD5(
            COALESCE(user_agent, '') || '|' ||
            COALESCE(screen_resolution, '') || '|' ||
            COALESCE(timezone, '')
        ) AS fp
    FROM session_events
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
    fp AS device_fingerprint,
    COUNT(DISTINCT user_id) AS accounts,
    ARRAY_AGG(DISTINCT user_id) AS user_ids
FROM fingerprints
GROUP BY fp
HAVING COUNT(DISTINCT user_id) > 3
ORDER BY accounts DESC;

The threshold of 3 is a starting point, not a law. For a delivery app where households share one phone, the natural baseline is 2-4 accounts per device per quarter; setting the alarm at 3 will bury your review queue in legitimate roommates. For a B2B SaaS where every seat is supposed to be one person, 2 distinct accounts on one fingerprint is already a signal. Calibrate by sampling 50 flagged rows and asking ops to label them.

A useful refinement is to weight by recency. A device that ran 7 accounts a year ago and 1 today is yesterday's problem; a device that ran 1 account last month and 5 this week is the live abuse. Add a MAX(created_at) to the aggregate and sort by the freshness of the cluster, not just the count.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Anomaly detection for bot spikes

A sudden jump in unique fingerprints is almost always a campaign — a referral abuse ring, a credential-stuffing run, or a scraper that forgot to reuse cookies:

SELECT
    DATE_TRUNC('hour', created_at)::TIMESTAMP AS hour,
    COUNT(DISTINCT device_fingerprint) AS unique_devices,
    COUNT(*) AS sessions,
    COUNT(*)::NUMERIC / NULLIF(COUNT(DISTINCT device_fingerprint), 0) AS sessions_per_device
FROM session_events
WHERE created_at >= CURRENT_DATE - INTERVAL '2 days'
GROUP BY DATE_TRUNC('hour', created_at)
ORDER BY hour;

Two patterns to watch for. A spike in unique_devices with sessions_per_device near 1.0 means thousands of fresh fingerprints each hitting the site once — almost always a scripted attack with randomised user-agents. A spike in sessions_per_device with flat unique devices is the opposite: one fingerprint hammering the site, usually a misconfigured cron or a single abuser refreshing.

If you want to make this production-grade, replace the human eyeball with a rolling z-score on unique_devices per hour against the prior 14 days for the same hour-of-week. Anything above 3 standard deviations gets a Slack ping; anything above 5 pages the on-call.

Common pitfalls

The first trap is assuming fingerprints are stable forever. Browser updates change the user-agent string every six weeks, GPU driver updates shift the WebGL renderer, and iOS point releases rewrite the canvas hash. A realistic stability window is 30 to 90 days for most users; design your join logic to tolerate the same device appearing under two fingerprints across that boundary.

The second is ignoring noise floor. A fingerprint of "Chrome on Windows 10 at 1920x1080 in UTC with English language" describes roughly forty million people. If you build your hash only from those four fields, every "match" is statistical garbage. Either include canvas and WebGL — which require client-side JavaScript collection — or accept that the hash is a coarse cohort key and not a device identifier.

The third is leaning entirely on server-side data. IP, ASN, and geo feel solid because they live in your logs without any client cooperation, but they are the easiest signals to manipulate. A motivated abuser running on a residential proxy network rotates IPs every request; your JS-side fingerprint is what actually catches them. Build the client-side collector first, treat server-side fields as supporting features.

The fourth is forgetting privacy compliance. Under GDPR and CCPA, device fingerprinting is treated as personal data in most cases and needs an explicit, granular consent banner — not a buried line in a 40-page policy. Run the design past legal before you ship the collector, especially if you serve traffic from the EU or California. Getting this wrong is a regulatory fine, not a polite warning email.

The fifth is auto-blocking on fingerprint count alone. The legitimate cases — a family on one laptop, an office with shared workstations, two siblings playing the same mobile game — look identical to a refund abuser in the data. Route flagged fingerprints into a human review queue or a soft challenge (CAPTCHA, email verification), and reserve hard blocks for cases where fingerprint matches another strong signal like a chargeback or a known-bad payment instrument.

Optimization tips

The fingerprint column is a perfect candidate for a clustering key on Snowflake or BigQuery. If your session events table has 8 billion rows and you query by device_fingerprint regularly, clustering by (device_fingerprint, created_at) will cut pruning costs by an order of magnitude. On Postgres or Redshift, the equivalent is a B-tree index on (device_fingerprint, created_at) and a SET STATISTICS 1000 if cardinality is high.

Pre-compute the hash at ingest, not at query time. Calling MD5 inside a 30-day aggregation over a billion rows costs real compute; calling it once when the event lands costs nothing. Add a materialised device_fingerprint column to the events table, update your ETL to populate it, and your downstream queries become simple GROUP BY operations instead of expensive scalar UDF passes.

For the multi-account detection query specifically, a daily incremental materialised view of (fingerprint, user_id, last_seen) lets you answer "show me all multi-account fingerprints" in under a second, instead of rescanning 30 days of raw events every time the trust-and-safety team opens the dashboard. The refresh cost is one merge per day.

If you want to drill fraud SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.

FAQ

What is the minimum set of fields for a useful fingerprint?

User-agent, screen resolution, and timezone is the absolute floor — together they give you maybe 12 bits of entropy, enough to distinguish broad cohorts but not individual machines. Adding canvas hash and audio context pushes you to roughly 18-22 bits, which is where the fingerprint starts being useful for catching dedicated abusers. Below the floor, you are essentially clustering by browser family, not identifying devices.

Is MD5 strong enough or do I need SHA-256?

For the matching use case described here, MD5 is more than enough. You are not protecting the hash against an adversary who wants to forge collisions — you are clustering rows that should be the same. The cryptographic weaknesses in MD5 (chosen-prefix collisions, etc.) are irrelevant when the inputs come from your own pipeline. Use SHA-256 if your security team requires it for policy reasons; the analytical answer does not change.

What about privacy law — can I even do this in 2026?

Yes, but with consent. GDPR (EU), CCPA/CPRA (California), and the post-2024 wave of state-level US privacy laws all treat device fingerprinting as identifying processing that needs an opt-in or a clear opt-out path. The practical pattern is a banner that distinguishes "strictly necessary" from "fraud prevention" from "analytics" and lets the user decline the fingerprint while still using the product. Talk to legal before launch — the fines run into the millions for repeat violations.

Can I match the same user across devices using a fingerprint?

No, and that is the point. The fingerprint identifies a device, not a person. Cross-device matching has to go through a stable identifier the user supplies themselves: an email address at signup, a phone number for OTP, a logged-in session token. Fingerprint plus login gives you a device-to-user join; fingerprint alone gives you device-to-device. Treat them as separate dimensions in your model.

Is FingerprintJS worth the subscription or should I build my own?

The hosted vendors (FingerprintJS Pro, ThreatMetrix, Sift) run roughly $30k to $200k a year depending on volume, and they give you a maintained collector, anti-evasion logic against headless Chrome, and a managed identity graph. Building in-house is cheaper in cash but expensive in engineering time — expect 1-2 engineer-quarters to reach feature parity on collection, and ongoing maintenance every time browsers change their APIs. The right call is usually: buy the collector, build the SQL layer yourself.

How do I handle mobile apps where there is no browser fingerprint?

Mobile apps expose a different set of signals: device model, OS version, screen density, locale, carrier, advertising identifier (IDFA/AAID where consent allows), and platform-specific install IDs. The SQL pattern is identical — concatenate, hash, group — but the input columns are different. On iOS post-ATT, you also need a fallback for users who declined tracking, usually a server-generated install ID that survives reinstalls within the same iCloud account.