Same Result, 100x Slower: The SQL Anti-Join at 500 Million Rows
"Find the pages that have no likes." It sounds like a one-line SQL question โ the kind you'd get in a five-minute interview screen. And on a toy table, every reasonable way of writing it returns the same answer instantly.
But how you write it stops being a stylistic choice the moment the tables get big. I took this exact problem โ DataLemur's "Page With No Likes" โ and ran five equivalent-looking queries against 500 million rows on Amazon RDS PostgreSQL. The spread was not subtle:
3 minutes 44 seconds to over 6 hours โ for queries that all return the same rows.
Here's what happened, and more importantly, why.
The problem
Two tables. pages lists every page; page_likes records who liked what.
CREATE TABLE pages (
page_id INT NOT NULL PRIMARY KEY,
page_name VARCHAR(20)
);
CREATE TABLE page_likes (
user_id INT NOT NULL,
page_id INT NOT NULL
);
CREATE INDEX page_likes_page_id ON page_likes (page_id);
We want every page_id in pages that never appears in page_likes. In relational terms this is an anti-join โ "rows in A with no match in B" โ and SQL gives you at least five ways to spell it.
The test rig: each table loaded with 500M rows on a db.m5.xlarge instance (4 vCPUs, 16 GB RAM). page_likes.page_id is a random value in [1, 500M], so โ by the coupon-collector argument โ roughly (1 โ 1/e) โ 37% of pages get no likes at all. The correct answer is about 184 million pages. Every query below returns that same number; only the clock changes.
The five contenders
1. EXCEPT โ the set-difference operator.
SELECT page_id FROM pages
EXCEPT
SELECT page_id FROM page_likes;
2. LEFT JOIN โฆ IS NULL โ join everything, keep the misses.
SELECT p.page_id
FROM pages p
LEFT JOIN page_likes pl ON p.page_id = pl.page_id
WHERE pl.page_id IS NULL;
3. NOT EXISTS โ a correlated existence check.
SELECT p.page_id
FROM pages p
WHERE NOT EXISTS (
SELECT 1 FROM page_likes pl
WHERE pl.page_id = p.page_id
);
4. NOT EXISTS with LIMIT 1 โ the same, with a "helpful" LIMIT 1 in the subquery.
5. NOT IN โ the one everyone reaches for first.
SELECT page_id FROM pages
WHERE page_id NOT IN (SELECT page_id FROM page_likes);
The results
| Query | Time |
|---|---|
LEFT JOIN โฆ IS NULL |
3m 44s |
NOT EXISTS |
3m 45s |
NOT EXISTS + LIMIT 1 |
3m 45s |
EXCEPT |
14m 12s |
NOT IN |
> 6 hours (killed, never finished) |
Three of them tie. One is ~4ร slower. One is a different category of catastrophe. The times only make sense once you look at the plan PostgreSQL builds for each.
Why the fast three are fast (and identical)
LEFT JOIN โฆ IS NULL and NOT EXISTS produce the exact same execution plan: a hash anti-join. Postgres builds a hash table of page_likes.page_id once, streams pages past it, and emits each page that finds no match. That's a single pass over each table โ about as good as it gets. The two queries reading within a second of each other isn't luck; the planner literally compiles them to the same operator.
And NOT EXISTS โฆ LIMIT 1? That LIMIT 1 does nothing. EXISTS is a semi-join โ it already stops at the first matching row by definition. Adding LIMIT 1 inside it is redundant; the planner ignores it, which is why the time is identical to plain NOT EXISTS. If you've been sprinkling LIMIT 1 into EXISTS subqueries thinking it's an optimization: it isn't. It's a no-op. (Harmless, but drop it โ it just adds noise.)
Takeaway #1: NOT EXISTS and LEFT JOIN โฆ IS NULL are your anti-join workhorses. Pick whichever reads more clearly to you โ the database sees them the same way.
Why EXCEPT is ~4ร slower
EXCEPT isn't wrong โ it returns the right rows โ but it's doing more work than you asked for. Set operators in SQL are distinct by default: EXCEPT must deduplicate both inputs before subtracting them. So Postgres sorts or hash-aggregates all 500M page_likes rows down to distinct values, dedups pages, and only then computes the difference.
The anti-join skips all of that. It never materializes a deduplicated set โ it just probes a hash table and moves on. EXCEPT pays for a guarantee (distinct results) you didn't need here, and at half a billion rows that guarantee costs an extra ten minutes.
Takeaway #2: reach for EXCEPT when you genuinely want set-difference-with-dedup semantics. For a plain anti-join, it's the wrong tool โ you're paying for deduplication you don't need.
Why NOT IN is a time bomb โ twice over
NOT IN with a subquery is the intuitive way to write this, and it's the one to avoid. It fails on two independent fronts.
1. Correctness โ the NULL trap. x NOT IN (โฆ) uses SQL's three-valued logic. If the subquery returns even one NULL, the whole expression can never be TRUE, and your query silently returns zero rows. We dodged it here only because page_likes.page_id is NOT NULL. Change that column to nullable, or point NOT IN at a column that can be null, and the query quietly produces wrong answers with no error. That's the worst kind of bug.
2. Performance โ the planner can't anti-join it. Precisely because of those NULL semantics, PostgreSQL will not rewrite NOT IN (subquery) into a hash anti-join the way it does for NOT EXISTS. Instead it evaluates the subquery as a SubPlan. If the distinct values fit in work_mem, it can hash them โ but 500M values don't come close to fitting. So it falls back to the pathological case: for each of the 500M outer pages, it re-scans the materialized subquery looking for the value. That's roughly O(n ร m) โ hundreds of millions times hundreds of millions of comparisons. Which is why it ran past six hours and got killed, while its cousins finished in under four minutes.
Takeaway #3: NOT IN (subquery) is both a correctness hazard (NULLs) and a performance cliff (no anti-join transform). Prefer NOT EXISTS, which handles NULLs correctly and optimizes well. Save NOT IN for small, hardcoded, non-null lists like WHERE status NOT IN ('done','cancelled').
What about the index?
There's an index on page_likes(page_id), and indexing your join key is always good advice. But be clear-eyed about when it pays off. At this scale โ 500M probing 500M, with ~37% of pages unmatched โ the planner chose a hash anti-join, which builds a hash table and largely ignores the index. An index shines for a selective anti-join (a small outer set, or when a nested-loop with index lookups beats hashing). Here, the win came from the query shape, not the index. The lesson isn't "indexes don't matter" โ it's "the biggest lever was choosing a formulation the planner can turn into an anti-join in the first place."
The real lesson
On the DataLemur sandbox, all five queries return instantly and the "right" answer is whichever you can type fastest. That's the trap. The habits you form on toy data are the habits that ship to production, and production has 500 million rows.
Same result set. Same tables. A 100ร difference โ 3m 44s versus did not finish โ decided entirely by which keyword you reached for. When you write an anti-join:
- Default to
NOT EXISTS(orLEFT JOIN โฆ IS NULL) โ correct with NULLs, optimizes to a hash anti-join. - Skip the
LIMIT 1insideEXISTSโ it does nothing. - Use
EXCEPTonly when you actually want distinct set-difference semantics. - Never point
NOT INat a subquery on a large or nullable column. - And whenever performance matters, run
EXPLAIN (ANALYZE, BUFFERS)yourself โ read the plan, don't guess from the syntax.
Numbers here are single wall-clock runs on one db.m5.xlarge against PostgreSQL on RDS; exact times depend on version, work_mem, cache warmth, and data distribution. The ranking, though, is structural โ it follows from how the planner treats each construct, so it holds well beyond this one benchmark. The data generator and benchmark notebooks are on GitHub.
0 Comments
Sign in to join the discussion.