7 ClickHouse JOIN Hacks That Aren’t Slow

The Old Advice About ClickHouse Joins Is Dangerous Today

If you search for ClickHouse JOIN advice online, you will find articles from 2020 telling you to denormalize everything. They say the engine cannot handle joins. They warn that only a hash join exists. They claim any join with a right table larger than available memory will crash. Every single one of those statements was accurate in 2020. Not one of them is true today.

clickhouse join performance

Between 2022 and early 2026, the ClickHouse team rebuilt the entire join subsystem from the ground up. Six join algorithms ship by default. A cost-based optimizer reorders joins automatically using column statistics. Runtime bloom filters prune fact tables at the storage layer. Disk spilling prevents out-of-memory crashes. None of these features sit behind experimental flags. They are all active and available in standard releases. Understanding how these changes affect clickhouse join performance is essential for anyone building analytical workloads on the platform.

Hack 1: Grace Hash Join — Disk Spilling Eliminates OOM Crashes

Before pull request #38191 landed, a join where the right table exceeded available memory would crash the query with an out-of-memory error. This was the single most upvoted join issue in ClickHouse history, open since June 2020. Users had no escape route other than buying more RAM or denormalizing their schema.

How Grace Hash Join Works

Grace hash join partitions both input tables into buckets using a secondary hash function. It processes one bucket pair at a time and writes inactive buckets to disk. This means memory usage stays bounded by the largest single bucket, not by the total size of the right table. When the join completes, it reads spilled buckets back from disk and finishes the match.

The trade-off is straightforward. Disk I/O is slower than memory access, so the query takes longer. But it finishes. Before this algorithm existed, the same query would fail entirely. For organizations running ClickHouse on commodity hardware with limited RAM, this single change transformed clickhouse join performance from fragile to reliable.

When to Rely on Grace Hash Join

ClickHouse selects grace hash join automatically when the estimated size of the right table exceeds the available memory budget. No configuration changes are required. The algorithm reached general availability in version 24.3. If you are running any release older than that, upgrading alone gives you crash-proof joins on large datasets.

Hack 2: Full Sorting Merge — Memory-Bounded Alternative for Pre-Sorted Data

Hash join algorithms consume memory proportional to the cardinality of the right table. Full sorting merge takes a different approach. It bounds memory by the sort buffer size rather than by the number of unique keys. This makes it a strong choice when one or both sides of the join arrive already sorted.

Why Sort-Merge Can Outperform Hash Join

When data arrives from a MergeTree table in primary key order, the sorting phase of sort-merge may require zero additional work. The algorithm then merges both sorted streams in a single pass using a small buffer. In benchmarks against pre-sorted data, full sorting merge often completes faster than parallel hash join while consuming a fraction of the memory.

Pull request #35796 introduced this algorithm. ClickHouse selects it automatically when the optimizer determines that sort costs are low relative to hash table construction costs. Users do not need to specify anything in their query syntax. The engine evaluates the trade-off at planning time.

Hack 3: Parallel Hash Join — Near-Linear Scaling Across CPU Cores

The original hash join algorithm in ClickHouse was single-threaded for hash table construction. Parallel hash join, introduced in pull request #70788 and made default in version 24.12, distributes the work across all available CPU cores. The speedup scales almost linearly with core count on modern hardware.

Zero Configuration Required

Parallel hash join activates automatically. There are no settings to toggle, no thresholds to tune, and no experimental flags to enable. If your server has sixteen cores, the hash table construction phase runs sixteen times faster than the old single-threaded approach. For queries joining moderate-sized tables that fit in memory, this is often the fastest algorithm available.

The engine still falls back to the single-threaded variant for very small tables where parallelization overhead would outweigh the benefit. The decision is entirely automatic and based on estimated row counts.

Hack 4: Equivalence-Set Predicate Pushdown — Up to 180x Speedup

Pull request #61216 introduced equivalence-set predicate pushdown, which is arguably the highest-impact join optimization in the entire four-year rebuild. The idea is simple. When two tables join on t1.id = t2.id and a filter exists as WHERE t1.id = 5, that filter is logically equivalent to t2.id = 5. The optimizer recognizes this and pushes the filter to both sides before the join executes.

Why This Produces Dramatic Speedups

Pushing a filter to the right side of a join means fewer rows participate in the hash table. Smaller hash tables mean less memory pressure and faster probe times. In benchmarks, this optimization achieved up to 180 times faster execution on star-schema workloads. The improvement is most pronounced when fact tables contain billions of rows and dimension filters reduce the active set to a tiny fraction.

Version 2026 extended this capability further in pull request #96596. The optimizer now uses a disjoint set union data structure to track transitive equalities across chains of inner joins. A filter on t1.id gets pushed to t2, t3, and beyond automatically.

Hack 5: Automatic OUTER to INNER Conversion — From 32 Seconds to 6 Milliseconds

Outer joins are structurally expensive. They preserve all rows from one side, which prevents several downstream optimizations. Pull request #62907 introduced automatic detection of cases where outer semantics are unnecessary. When a LEFT JOIN includes a post-join filter such as WHERE right_col IS NOT NULL, the outer join becomes semantically identical to an inner join.

The Performance Impact Is Staggering

One benchmarked query in the ClickHouse test suite dropped from 32 seconds to 0.006 seconds after this conversion. The reason is twofold. First, converting to an inner join immediately halves the number of rows that need processing in many cases. Second, inner joins unlock predicate pushdown and join reordering that are structurally impossible for outer joins. The optimizer can then apply equivalence-set pushdown and cost-based reordering, compounding the savings.

This conversion happens automatically during query planning. No query hints or syntax changes are needed. If your workload includes outer joins with filters that effectively nullify the outer semantics, upgrading to a version that includes this optimization will improve clickhouse join performance without any code changes.

Hack 6: Right-Side and OR-Condition Pushdown — 27x Improvement

Predicate pushdown initially worked only for AND conditions on the left side of a join. Two further extensions closed important gaps. Pull request #50532 added right-side predicate pushdown, which achieved a 27 times improvement on applicable workloads. Pull request #84735 added OR-condition pushdown, allowing filters combined with OR to propagate to both sides of a join.

You may also enjoy reading: How Boycotts Hurt Tesla Sales Now Hit SpaceX IPO.

Why OR Conditions Matter

Queries with OR filters in WHERE clauses were historically difficult for ClickHouse to optimize. The engine could push AND conditions to both sides, but OR conditions stayed on the probe side, forcing full scans. With OR-condition pushdown, a filter like WHERE t1.category = 'electronics' OR t1.category = 'apparel' gets pushed to both sides of the join, reducing the data that reaches the join operator.

These optimizations require no configuration. They activate automatically when the planner detects applicable patterns. For workloads with complex filter logic across joined tables, the cumulative effect of these pushdown rules can reduce query latency by an order of magnitude or more.

Hack 7: Cost-Based JOIN Reordering with Column Statistics — 1,450x Speedup

Before version 25.9, ClickHouse executed joins in the exact order specified by the query syntax. If a user wrote a query joining a large fact table to a small dimension table first, the engine followed that order even if a different order would produce smaller intermediate results. Pull request #86822 changed this entirely by introducing a greedy join reordering algorithm that uses column statistics.

How the Optimizer Chooses Join Order

The greedy algorithm evaluates estimated cardinalities for each possible join pair and selects the order that minimizes intermediate result sizes. Column statistics, which became automatic in version 25.10, provide the cardinality estimates needed for accurate cost calculations. The optimizer considers not just table sizes but also filter selectivity and key distribution.

Version 25.12 added an even more powerful option. Pull request #91002 introduced DPsize, a dynamic programming algorithm that performs exhaustive search on complex join graphs. While greedy reordering handles most practical cases efficiently, DPsize finds optimal plans for queries with five or more tables where greedy might miss a better ordering.

Measured Results from TPC-H SF100

The benchmark results speak clearly. On the TPC-H SF100 dataset, cost-based join reordering delivered a 1,450 times speedup and a 25 times memory reduction compared to syntax-order execution. These numbers come from the official ClickHouse performance testing suite. The improvement is not theoretical. It is reproducible on standard benchmark hardware.

Runtime hash table sizes from previous executions feed back into future plans through the statistics system. This means the optimizer gets smarter over time as it observes actual data distributions. The feedback loop is automatic and continuous.

Runtime Bloom Filters: The Unsung Hero of Star Schema Joins

While not one of the seven numbered hacks, runtime bloom filters deserve mention because they amplify all the other optimizations. Since version 25.10, ClickHouse builds bloom filters from build-side join keys and pushes them to the probe-side table scan. Non-matching rows are discarded at the storage layer before they ever reach the join operator.

The version 25.10 release blog reported a 2.1 times overall query speedup and a 7 times memory reduction on star-schema workloads from bloom filters alone. Filters can also be pushed into PREWHERE for even earlier pruning during the scan phase. This means the storage engine skips entire granules of data that cannot possibly produce join matches.

Bloom filters are enabled by default. No configuration changes are needed. If you are running ClickHouse 25.10 or later, every join query already benefits from this optimization.

What This Means for Your ClickHouse Workloads

The advice to avoid joins in ClickHouse was reasonable in 2020. It is obsolete today. The engine now selects from six algorithms automatically, spills to disk instead of crashing, pushes predicates across join boundaries using equivalence sets, converts outer joins to inner joins when filters allow it, and reorders joins based on actual column statistics. All of these features are active by default.

If you are still denormalizing schemas specifically to avoid join performance problems, it may be time to revisit that decision. Normalized schemas with well-designed join queries can now outperform denormalized tables in many scenarios, especially when fact tables are large and dimension tables change frequently. The rebuild of the join subsystem between 2022 and 2026 changed the fundamental capabilities of the engine. The old rules of thumb no longer apply.

Add Comment