SPDK + NVMe: Building a User-Space Storage Engine That Hits 10M IOPS
Why the Kernel Is the Problem
The Linux kernel NVMe driver is excellent general-purpose software. It handles device discovery, namespace management, error recovery, power management, and fair scheduling across multiple processes. But all of that generality costs cycles.
Here's what happens on a single 4KB read through the kernel:
- Application calls
read()orio_uring_submit() - System call transition (user → kernel context switch): ~1-2 microseconds
- VFS layer lookup and permission checks
- Block layer: merge, schedule, create bio/request structures
- NVMe driver: map to NVMe submission queue entry, ring doorbell
- Wait for completion interrupt
- Interrupt handler: process completion queue, wake up waiter
- Context switch back to user space
Steps 2-4 and 6-8 are pure overhead. The actual NVMe command submission is a single 64-byte write to a submission queue. The completion is a 16-byte read from a completion queue. Everything else is the kernel earning its keep for features you may not need.
Measured with perf on a tuned system, a single 4KB read through the kernel takes 8-12 microseconds end-to-end. The NVMe device itself completes the operation in 6-8 microseconds (Optane) or 80-100 microseconds (NAND flash). For Optane, the kernel overhead doubles the latency. For flash, it adds 10-15%. But the real killer is throughput: interrupt handling and context switching don't scale linearly with queue depth.
SPDK's Architecture: Cutting Out the Middleman
SPDK (Storage Performance Development Kit) takes a radical approach: move the entire NVMe driver into user space and eliminate interrupts entirely.
The architecture has three pillars:
1. User-space drivers: SPDK unbinds NVMe devices from the kernel driver and binds them to uio_pci_generic or vfio-pci. This gives the application direct access to the device's PCIe BAR registers (the doorbell registers for submission/completion queues) and the ability to set up DMA mappings without kernel involvement.
2. Polled I/O: Instead of waiting for completion interrupts, SPDK dedicates CPU cores to continuously polling the completion queue. This trades CPU cycles for latency — you burn a core at 100% utilization, but completions are processed within nanoseconds of arriving instead of waiting for interrupt coalescing and handler scheduling.
3. Lock-free design: Each poller thread owns its I/O channels exclusively. No locks, no contention, no cache line bouncing between cores. The submission queue, completion queue, and all associated data structures are thread-local.
Here's the minimal SPDK NVMe application structure:
#include "spdk/stdinc.h"
#include "spdk/nvme.h"
#include "spdk/env.h"
struct worker_ctx {
struct spdk_nvme_ctrlr *ctrlr;
struct spdk_nvme_ns *ns;
struct spdk_nvme_qpair *qpair;
uint64_t io_completed;
uint64_t io_submitted;
char *buf;
};
static void
read_complete(void *arg, const struct spdk_nvme_cpl *completion)
{
struct worker_ctx *ctx = arg;
ctx->io_completed++;
if (spdk_nvme_cpl_is_error(completion)) {
fprintf(stderr, "I/O error: sct=%x, sc=%x\n",
completion->status.sct, completion->status.sc);
return;
}
/* Resubmit immediately for sustained throughput */
uint64_t lba = rand() % spdk_nvme_ns_get_num_sectors(ctx->ns);
int rc = spdk_nvme_ns_cmd_read(ctx->ns, ctx->qpair,
ctx->buf, lba, 1,
read_complete, ctx, 0);
if (rc == 0) {
ctx->io_submitted++;
}
}
static void
worker_poll(void *arg)
{
struct worker_ctx *ctx = arg;
/* This is the hot loop — runs on a dedicated core */
while (!g_shutdown) {
/* Process completions — non-blocking, returns immediately */
spdk_nvme_qpair_process_completions(ctx->qpair, 0);
}
}
The critical function is spdk_nvme_qpair_process_completions(). It directly reads the completion queue entries from the DMA-mapped memory region. No system call. No interrupt. No context switch. Just a memory read, a comparison against the phase tag, and a callback invocation.
Zero-Copy DMA: Where the Real Magic Is
In the kernel path, data follows this journey for a read:
NVMe SSD → DMA to kernel buffer → copy to user buffer → application
That copy from kernel to user space is a memcpy of your I/O size. At 4KB it's cheap. At 128KB (common for sequential workloads) it's burning memory bandwidth. At high IOPS, the aggregate copy bandwidth becomes significant.
SPDK eliminates this by mapping application buffers directly for DMA:
/* Allocate DMA-capable buffer — physically contiguous, hugepage-backed */
ctx->buf = spdk_dma_zmalloc(4096, 4096, NULL);
if (!ctx->buf) {
fprintf(stderr, "Failed to allocate DMA buffer\n");
return -ENOMEM;
}
/* The NVMe device DMAs directly into this buffer.
* No kernel involvement. No copy. */
int rc = spdk_nvme_ns_cmd_read(ctx->ns, ctx->qpair,
ctx->buf, /* DMA target */
lba, 1, /* LBA and sector count */
read_complete, ctx, 0);
The spdk_dma_zmalloc() function allocates from the hugepage pool, ensures physical contiguity (critical for DMA scatter-gather), and registers the memory region with the IOMMU. When the NVMe device completes the read, the data lands directly in your application's buffer. Zero copies. Zero kernel involvement.
For my storage engine, I pre-allocated a pool of 16,384 DMA buffers at startup:
#define BUFFER_POOL_SIZE 16384
#define BUFFER_SIZE 4096
struct buffer_pool {
char *buffers[BUFFER_POOL_SIZE];
uint32_t free_stack[BUFFER_POOL_SIZE];
uint32_t top;
};
static int
init_buffer_pool(struct buffer_pool *pool)
{
for (int i = 0; i < BUFFER_POOL_SIZE; i++) {
pool->buffers[i] = spdk_dma_zmalloc(BUFFER_SIZE, BUFFER_SIZE, NULL);
if (!pool->buffers[i]) {
return -ENOMEM;
}
pool->free_stack[i] = i;
}
pool->top = BUFFER_POOL_SIZE;
return 0;
}
static inline char *
pool_get(struct buffer_pool *pool)
{
if (pool->top == 0) return NULL;
return pool->buffers[pool->free_stack[--pool->top]];
}
static inline void
pool_put(struct buffer_pool *pool, char *buf)
{
/* Find index — in production, store index alongside buffer */
for (int i = 0; i < BUFFER_POOL_SIZE; i++) {
if (pool->buffers[i] == buf) {
pool->free_stack[pool->top++] = i;
return;
}
}
}
The pool is per-thread (remember, no sharing between poller threads), so pool_get and pool_put are simple stack operations with no synchronization.
The 10M IOPS Build
Here's the architecture that got us to 10 million IOPS:
┌─────────────┐
│ Application │
│ Thread │
│ (dispatch) │
└──────┬──────┘
│ SPDK ring buffers
┌────────────┼────────────┐
│ │ │
┌─────▼────┐ ┌────▼─────┐ ┌────▼─────┐
│ Poller 0 │ │ Poller 1 │ │ Poller 2 │ ... (8 pollers)
│ Core 2 │ │ Core 3 │ │ Core 4 │
└─────┬────┘ └────┬─────┘ └────┬─────┘
│ │ │
┌─────▼────┐ ┌────▼─────┐ ┌────▼─────┐
│ NVMe 0 │ │ NVMe 1 │ │ NVMe 2 │ ... (8 SSDs)
│ QP: 128 │ │ QP: 128 │ │ QP: 128 │
└──────────┘ └──────────┘ └──────────┘
Each poller thread is pinned to a dedicated CPU core and owns one NVMe device exclusively. The queue pair depth is 128 — enough to keep the device saturated without excessive memory usage.
The key tuning parameters:
struct spdk_nvme_io_qpair_opts qpair_opts;
spdk_nvme_ctrlr_get_default_io_qpair_opts(ctrlr, &qpair_opts, sizeof(qpair_opts));
qpair_opts.io_queue_size = 128; /* Queue depth */
qpair_opts.io_queue_requests = 256; /* Pre-allocated request pool */
qpair_opts.delay_cmd_submit = true; /* Batch doorbell writes */
struct spdk_nvme_qpair *qpair =
spdk_nvme_ctrlr_alloc_io_qpair(ctrlr, &qpair_opts, sizeof(qpair_opts));
The delay_cmd_submit flag is crucial. Without it, every spdk_nvme_ns_cmd_read() call writes to the submission queue doorbell register (a PCIe MMIO write). Doorbell writes are expensive — each one is a posted PCIe transaction that costs ~200-500ns. With batching enabled, SPDK accumulates submissions and rings the doorbell once per poll cycle, amortizing the cost across multiple I/Os.
Benchmark Results
Hardware: Dual Intel Xeon 8380 (80 cores total), 512GB DDR4-3200, 8x Intel P5800X 800GB Optane SSDs, PCIe Gen4 x4 per drive.
| Configuration | Random 4KB Read IOPS | Avg Latency | P99 Latency |
|--------------|---------------------|-------------|-------------|
| Kernel io_uring (8 drives) | 2.1M | 48 us | 120 us |
| Kernel io_uring + polling | 3.4M | 28 us | 65 us |
| SPDK (8 pollers, 8 drives) | 10.8M | 7.2 us | 12 us |
| SPDK (tuned, batched) | 11.4M | 6.8 us | 10.5 us |
The latency numbers tell the real story. P99 latency dropped from 120 microseconds (kernel) to 10.5 microseconds (SPDK tuned). That's an 11x improvement at the tail. For a time-series database doing point queries, that tail latency directly translates to query response time.
The CPU cost is real though. Each poller thread consumes 100% of its core. That's 8 cores dedicated purely to I/O processing. With the kernel driver, those cores would be available for application logic (though they'd spend significant time in interrupt handlers and context switches). The trade-off is explicit: you're buying latency and throughput with CPU cores.
Operational Challenges
Error handling: The kernel NVMe driver handles transient errors, controller resets, and namespace changes gracefully. In SPDK, you handle all of that yourself. A controller reset means draining all in-flight I/Os, re-establishing the admin queue, re-creating I/O queue pairs, and re-submitting pending operations. I wrote about 2,000 lines of error recovery code.
static void
handle_controller_reset(struct worker_ctx *ctx)
{
/* Drain in-flight I/Os — they won't complete */
spdk_nvme_qpair_process_completions(ctx->qpair, 0);
/* Free the old qpair */
spdk_nvme_ctrlr_free_io_qpair(ctx->qpair);
/* Reset the controller */
int rc = spdk_nvme_ctrlr_reset(ctx->ctrlr);
if (rc) {
fprintf(stderr, "Controller reset failed: %d\n", rc);
/* At this point, the device is gone. Failover. */
trigger_device_failover(ctx);
return;
}
/* Re-create the qpair */
ctx->qpair = spdk_nvme_ctrlr_alloc_io_qpair(ctx->ctrlr, &qpair_opts,
sizeof(qpair_opts));
if (!ctx->qpair) {
trigger_device_failover(ctx);
return;
}
/* Re-submit pending operations from the retry queue */
resubmit_pending_ios(ctx);
}
Hot-plug: Production storage systems need to handle drive failures and replacements. SPDK has hot-plug detection, but integrating it with your application's data placement and replication logic is entirely on you. I spent three weeks on hot-plug handling alone.
Observability: No /proc/diskstats. No iostat. No blktrace. You build your own metrics collection within the SPDK application. I export Prometheus metrics from each poller thread via a lightweight HTTP handler running on a separate core:
struct io_stats {
_Atomic uint64_t reads_completed;
_Atomic uint64_t writes_completed;
_Atomic uint64_t read_bytes;
_Atomic uint64_t write_bytes;
_Atomic uint64_t read_latency_sum_ns;
_Atomic uint64_t write_latency_sum_ns;
uint64_t latency_histogram[32]; /* power-of-2 buckets */
};
Memory management: SPDK requires hugepages. In production, you need to configure hugepages at boot time via kernel parameters, not at runtime. Runtime hugepage allocation is unreliable on systems that have been running for a while due to memory fragmentation:
# /etc/default/grub
GRUB_CMDLINE_LINUX="default_hugepagesz=2M hugepagesz=2M hugepages=4096 intel_iommu=on iommu=pt"
That's 8GB of hugepages reserved at boot. For our 8-drive setup with deep queue depths, we actually needed 16GB.
When to Use SPDK (and When Not To)
Use SPDK when:
- You need single-digit microsecond I/O latency (Optane, persistent memory)
- You need to saturate multiple NVMe devices (4+ drives)
- You're building a storage engine, database, or caching layer
- You can dedicate CPU cores to I/O processing
- You have the engineering capacity to handle error recovery and ops tooling
Don't use SPDK when:
- Your I/O latency is dominated by NAND flash latency (100us+) and the kernel overhead is noise
- You need standard filesystem semantics (POSIX, directory trees)
- You have one or two NVMe drives — the kernel driver with
io_uringis probably fine - You don't want to manage hugepages, driver bindings, and custom monitoring
The kernel NVMe driver with io_uring and polling mode gets you surprisingly far. On our hardware, it achieved 3.4M IOPS — more than enough for most workloads. SPDK only makes sense when you need every last IOPS the hardware can deliver and you're willing to pay the complexity tax.
For that time-series database customer, the 11.4M IOPS meant they could run their ingest pipeline and query engine on the same hardware that previously needed three separate clusters. The hardware savings paid for six months of my consulting time in the first quarter. That's the kind of math that makes the complexity worth it.