Databases

ClickHouse: Columnar OLAP for Real-Time Analytics on 10 Billion Rows

March 20268 min read

Lucio Durán

Engineering Manager & AI Solutions Architect

Columnar Storage and Analytical Query Performance

In a row-oriented database like PostgreSQL, data is stored row by row:

Row 1: [user_id=123, event="click", timestamp=2026-01-15, page="/home", duration=3.2, country="US", device="mobile", ...]
Row 2: [user_id=456, event="view", timestamp=2026-01-15, page="/product", duration=8.1, country="DE", device="desktop", ...]

When you run SELECT country, COUNT(*) FROM events WHERE event = 'click' GROUP BY country, PostgreSQL reads every row from disk, including user_id, page, duration, device — columns you don't need. On a table with 50 columns, you're reading 50x more data than necessary.

ClickHouse stores data column by column:

user_id column: [123, 456, 789, 101, ...]
event column: ["click", "view", "click", "purchase", ...]
timestamp column: [2026-01-15, 2026-01-15, 2026-01-16, ...]
country column: ["US", "DE", "US", "JP", ...]

That same query only reads the event and country columns. On a table with 10 billion rows and 50 columns, that's ~98% less I/O. But it gets better — each column is independently compressed. The event column might have only 20 distinct values across 10 billion rows, which compresses from 80GB (raw strings) to ~500MB with LZ4 + dictionary encoding.

MergeTree: The Engine That Makes It Work

Every ClickHouse table is backed by a table engine. MergeTree is the primary engine for analytical workloads:

CREATE TABLE events (
 tenant_id UInt32,
 event_date Date,
 event_time DateTime64(3),
 event_type LowCardinality(String),
 user_id UInt64,
 session_id String,
 page_url String,
 country LowCardinality(FixedString(2)),
 device_type LowCardinality(String),
 duration_ms UInt32,
 properties Map(String, String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_date, event_type, user_id)
TTL event_date + INTERVAL 13 MONTH
SETTINGS index_granularity = 8192;

Key design decisions here:

PARTITION BY: Monthly partitions by event date. Partitions are the unit of data management — you can drop, move, or detach entire months efficiently. Don't over-partition (daily partitions on a table with 100 tenants creates too many data parts).

ORDER BY: This is the primary key and the most important decision. ClickHouse stores data sorted by this key and builds a sparse index (one entry per index_granularity rows) for data skipping. Queries that filter on the leading columns of the ORDER BY key skip entire blocks of data without reading them.

tenant_id goes first because every query in a multi-tenant system filters by tenant. Then event_date for time-range queries, event_type for category filtering, and user_id for user-level drilldowns. This order means:

WHERE tenant_id = 42 AND event_date = '2026-01-15' → skips 99.9% of data blocks
WHERE tenant_id = 42 AND event_type = 'click' → skips 99% of data blocks
WHERE user_id = 12345 → full scan (user_id is last in key, can't skip)

TTL: Automatic data expiration. After 13 months, rows are deleted during background merges. No cron jobs. No manual cleanup.

Materialized Views: Pre-Aggregation at Ingest Time

Raw event tables are great for ad-hoc exploration, but production dashboards need consistent sub-second response times. Materialized views in ClickHouse aren't what you might expect from PostgreSQL — they're incrementally updated at insert time, not periodically refreshed.

-- Materialized view for daily event counts per tenant
CREATE MATERIALIZED VIEW events_daily_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_date, event_type, country)
AS SELECT
 tenant_id,
 event_date,
 event_type,
 country,
 count() AS event_count,
 uniqState(user_id) AS unique_users,
 sumState(duration_ms) AS total_duration_ms
FROM events
GROUP BY tenant_id, event_date, event_type, country;

When you INSERT INTO events, ClickHouse automatically transforms the new rows through the materialized view's SELECT and inserts the aggregated results into the events_daily_mv table. The SummingMergeTree engine merges rows with the same ORDER BY key during background merges, summing the numeric columns.

The -State suffix functions (uniqState, sumState) store intermediate aggregation states. You query them with the corresponding -Merge functions:

SELECT
 event_date,
 event_type,
 sum(event_count) AS events,
 uniqMerge(unique_users) AS users,
 sumMerge(total_duration_ms) / sum(event_count) AS avg_duration_ms
FROM events_daily_mv
WHERE tenant_id = 42
 AND event_date BETWEEN '2026-01-01' AND '2026-01-31'
GROUP BY event_date, event_type
ORDER BY event_date;

This query scans the pre-aggregated table (~10,000 rows for a month) instead of the raw events table (~500 million rows for the same tenant and period). Response time: 15ms vs 1,800ms.

A typical setup includes four materialized views on the events table:

Daily counts by tenant/event_type/country (for dashboards)
Hourly session metrics (for real-time monitoring)
Weekly user retention cohorts (for product analytics)
Daily funnel stage counts (for conversion tracking)

Total storage overhead for all four MV tables: ~3% of the raw events table. The query performance improvement is 50-200x for dashboard queries.

Query Optimization on 10B Rows

Here are the real optimization techniques that matter at scale:

1. Filter on ORDER BY columns

-- Fast: tenant_id and event_date are first two ORDER BY columns
SELECT event_type, count()
FROM events
WHERE tenant_id = 42 AND event_date = '2026-03-15'
GROUP BY event_type;
-- Scans: ~2M rows out of 10B (0.02%)

-- Slow: country is not in ORDER BY key
SELECT country, count()
FROM events
WHERE country = 'US'
GROUP BY country;
-- Scans: all 10B rows

2. Use PREWHERE for secondary filters

SELECT user_id, count() AS events
FROM events
PREWHERE tenant_id = 42 AND event_date >= '2026-01-01'
WHERE event_type = 'purchase' AND duration_ms > 5000
GROUP BY user_id
ORDER BY events DESC
LIMIT 100;

PREWHERE reads only the columns needed for the condition first, filters rows, then reads the remaining columns only for matching rows. ClickHouse automatically rewrites WHERE to PREWHERE for leading ORDER BY columns, but for non-key column filters, explicit PREWHERE can reduce I/O significantly.

3. Approximate functions for large-scale analytics

-- Exact unique count: 4.2 seconds on 10B rows
SELECT uniqExact(user_id) FROM events WHERE tenant_id = 42;

-- Approximate unique count (2% error): 0.3 seconds
SELECT uniq(user_id) FROM events WHERE tenant_id = 42;

-- Approximate quantiles: much faster than exact
SELECT
 quantilesTimingWeighted(0.5, 0.9, 0.95, 0.99)(duration_ms, 1)
FROM events
WHERE tenant_id = 42 AND event_date = today();

For dashboards where +-2% accuracy is fine, uniq() (HyperLogLog) is 14x faster than uniqExact(). The business stakeholders never noticed the difference.

4. Sampling for interactive exploration

CREATE TABLE events_sampled (
 -- same columns as events
)
ENGINE = MergeTree()
ORDER BY (tenant_id, event_date, sipHash64(session_id))
SAMPLE BY sipHash64(session_id);

-- Query 10% of data for interactive exploration
SELECT event_type, count() * 10 AS estimated_count
FROM events_sampled
SAMPLE 0.1
WHERE tenant_id = 42
GROUP BY event_type;

Sampling is deterministic (same sample every time for the same data) and honest about the trade-off: 10x speedup for 10% sample with ~3% error for aggregations on large datasets.

ClickHouse vs DuckDB vs Druid: Comparative Analysis

Having evaluated production workloads on all three, here's where each shines:

Criteria	ClickHouse	DuckDB	Druid
Data scale	100GB - PB	1GB - 100GB	100GB - PB
Concurrent users	Dozens to hundreds	Single user	Hundreds to thousands
Ingestion	Batch (high throughput)	Batch/file	Real-time streaming
Query latency (10B rows)	200ms - 5s	N/A (local only)	100ms - 2s
Operational complexity	Medium	None (embedded)	High
SQL support	Full analytical SQL	Full SQL + extensions	Limited SQL
Best for	Analytics backend API	Data science, local analysis	Real-time dashboards

DuckDB is remarkable for what it does — analytical queries on local files with zero infrastructure. It's excellent for prototyping queries before deploying to ClickHouse, for ad-hoc analysis on CSV/Parquet exports, and in CI pipelines for data validation. But it's not a server. You can't point 50 dashboard users at a DuckDB instance.

Druid is purpose-built for real-time analytics with sub-second queries on very high-cardinality dimensions. It handles concurrent queries better than ClickHouse at extreme scale (thousands of concurrent users). But Druid's operational complexity is significantly higher — it has six node types (Historical, MiddleManager, Broker, Router, Coordinator, Overlord), each needing separate capacity planning. ClickHouse has one node type.

ClickHouse sits in the sweet spot for most analytics backends: handles production-scale data (10B+ row tables), sub-second queries for dashboard use cases, reasonable operational complexity, and full SQL support. The main limitation is that single-row inserts are expensive — you need to batch.

The Production Architecture

A representative production setup:

Kafka ──► Buffer Table ──► Events Table ──► Materialized Views
 │
 ┌───────┼───────┐
 │ │ │
 Dashboard API Ad-hoc
 Queries Queries Queries

ClickHouse cluster: 3 nodes, each with 32 cores, 128GB RAM, 4TB NVMe SSD. ReplicatedMergeTree with 2 replicas for fault tolerance.

-- Buffer table absorbs high-frequency inserts
CREATE TABLE events_buffer AS events
ENGINE = Buffer(currentDatabase(), events, 16, 10, 100, 10000, 100000, 1000000, 10000000);

The Buffer table accumulates inserts in memory and flushes to the underlying events table in batches. This converts our Kafka consumer's per-message inserts into efficient batch inserts of 10,000-100,000 rows.

Ingest rate: ~120,000 events/second sustained. Query load: ~200 queries/second across all dashboard users. P99 query latency for dashboard queries (hitting materialized views): 180ms. P99 for ad-hoc queries on raw events (full month, single tenant): 3.2 seconds.

The whole thing runs on three servers that cost about $2,400/month. The equivalent PostgreSQL setup (which couldn't even serve the queries in time) was running on an r6g.8xlarge ($2,800/month) and still timing out.

ClickHouse isn't magic. It's engineering trade-offs made explicit: you give up transactions, row-level updates, and ACID guarantees. In return, you get analytical query performance that's genuinely 100-1000x faster than general-purpose databases. For analytics workloads, that trade-off is obvious once you've seen the numbers.

Tips

Tip: Always define your ORDER BY key in MergeTree tables to match your most common WHERE and GROUP BY patterns. ClickHouse uses the primary key for data skipping — if your queries filter by tenant_id and event_date, put those first in the ORDER BY. Queries can go from 12 seconds to 200ms just by reordering the primary key to match access patterns.

Tip: Use LowCardinality(String) instead of plain String for any column with fewer than ~10,000 distinct values (country codes, event types, status fields). It dictionary-encodes the values and can reduce storage by 80% while making filtering 3-5x faster. Converting a 2TB event table to LowCardinality on three columns can shrink it to 400GB.

Tip: Never use SELECT * in ClickHouse. Columnar storage means each column is a separate file — SELECT * reads every column from disk even if you only need three. On a table with 50 columns, SELECT col1, col2, col3 runs 18x faster than SELECT * for the same filter. Be explicit about what is needed.

Tip: Set max_threads per query based on your workload. The default uses all available cores, which is fine for single-user analytics but will starve concurrent queries. For a multi-tenant analytics API serving 50 concurrent users, setting max_threads = 4 per query yields much more consistent p99 latencies — 800ms instead of wild swings between 200ms and 5s.

clickhouseolapanalyticscolumnar-databasemergetreematerialized-viewsreal-time-analyticsduckdbdruid

Tools mentioned in this article

AWSTry AWS

DigitalOceanTry DigitalOcean

Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you. I only recommend tools I personally use and trust.

X / Twitter LinkedIn WhatsApp

Seguime

Back to Blog