GPU Rendering Pipeline: Blend Modes, Porter-Duff Compositing, and Tile-Based Rendering
The Browser Rendering Pipeline After Layout
Most diagrams of the browser rendering pipeline show something like: DOM → Style → Layout → Paint → Composite. That's not wrong, but it's misleading because it suggests a clean linear flow. In reality, the last two stages are where all the complexity lives, and they interact in non-obvious ways.
Here's what actually happens after layout:
1. Paint record generation (main thread)
- Walk the layout tree in stacking order
- Generate a display list of paint operations (draw rect, draw text, etc.)
- Group operations into paint layers based on compositing triggers
2. Rasterization (compositor thread + raster workers)
- Divide each paint layer into tiles (typically 256x256 px)
- Schedule tile rasterization on worker threads or GPU
- Skia converts paint ops to GPU commands (or CPU software rendering)
3. Compositing (compositor thread → GPU process)
- Arrange rasterized tile textures according to layer tree
- Apply transforms, clip, opacity, filters, blend modes
- Submit draw calls to the GPU for final framebuffer output
4. Display (GPU → display controller)
- Swap buffers or present
- VSync synchronization
The critical insight: rasterization and compositing are different GPU operations. Rasterization converts vector content (paths, text, boxes) into pixel textures. Compositing combines those textures into the final image. They can happen on different threads, different GPU queues, and even at different frame rates.
Blend Modes: The Math Behind the Magic
Blend modes are defined by the W3C Compositing and Blending specification, which itself builds on the Porter-Duff compositing algebra from 1984. Every blend operation has two components:
- Blending function B(Cb, Cs): How source and destination colors combine
- Compositing operator: How source and destination alpha channels interact
// General compositing formula:
// Co = αs × Fa × Cs + αb × Fb × Cb
//
// Where Fa and Fb depend on the Porter-Duff operator:
// source-over: Fa = 1, Fb = 1 - αs
// source-in: Fa = αb, Fb = 0
// source-out: Fa = 1 - αb, Fb = 0
// etc.
// For blending with source-over compositing:
// Co = αs × (1 - αb) × Cs + αs × αb × B(Cb, Cs) + (1 - αs) × αb × Cb
// Common blend mode functions:
// Multiply: B(Cb, Cs) = Cb × Cs
// Screen: B(Cb, Cs) = Cb + Cs - Cb × Cs
// Overlay: B(Cb, Cs) = HardLight(Cs, Cb) // yes, it's reversed
// Soft-Light: B(Cb, Cs) = complex piecewise function...
At the GPU level, "simple" blend modes that map to the fixed-function blend unit are essentially free — the hardware does them at no extra cost during rasterization. The standard source-over (normal alpha compositing) is a single blending state configuration:
// OpenGL equivalent of source-over compositing
glEnable(GL_BLEND);
glBlendFuncSeparate(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA,
GL_ONE, GL_ONE_MINUS_SRC_ALPHA);
But here's where it gets expensive. Advanced blend modes like multiply, screen, overlay, color-dodge, etc., cannot be expressed as fixed-function blend equations on older GPUs. The hardware blend unit only supports a limited set of operations (add, subtract, min, max) with configurable source/destination factors.
To handle advanced blend modes, the GPU (or the browser) has three options:
Option 1: KHR_blend_equation_advanced (GPU extension)
- Supported on most modern GPUs (Vulkan, GL 4.x, Metal)
- Adds hardware blend modes for multiply, screen, overlay, etc.
- Requires coherent blending barriers between draw calls
- Fast, but barrier overhead can add up
Option 2: Dual-source blending trick (limited modes)
- Use fragment shader to output two colors
- Configure blend unit to combine them
- Only works for modes expressible with two linear terms
Option 3: Framebuffer readback in shader
- Read the current framebuffer color in the fragment shader
- Compute the blend result manually
- Write back the blended color
- SLOW: breaks the GPU's parallel pipeline model
That SVG animation I mentioned? It was hitting option 3. The mix-blend-mode: multiply on a 1920x1080 element was causing the fragment shader to read back the framebuffer at every pixel, serializing what should have been a parallel operation. Moving the blend mode to a smaller, pre-composited element cut the cost by 98%.
Tile-Based Rendering: Why Your Phone's GPU Is Weird
Desktop GPUs (NVIDIA, AMD) use an Immediate Mode Renderer (IMR) architecture: each triangle is processed completely through the pipeline and its fragments are written directly to the framebuffer in off-chip memory.
Mobile GPUs (Qualcomm Adreno, ARM Mali, Apple GPU) use Tile-Based Deferred Rendering (TBDR): the viewport is divided into small tiles (typically 16x16 or 32x32 pixels), all geometry is binned into tiles, and each tile is rendered entirely in fast on-chip memory before being written out once.
IMR (Desktop):
For each triangle:
Rasterize → shade fragments → write to framebuffer (off-chip)
Bandwidth: HIGH (constant framebuffer read/write)
Parallelism: HIGH (triangles are independent)
TBDR (Mobile):
Pass 1 - Binning:
For each triangle:
Determine which tiles it touches → add to tile's bin
Pass 2 - Rendering:
For each tile:
Load tile into on-chip memory
For each triangle in this tile's bin:
Rasterize → shade fragments → write to tile memory (on-chip)
Write finished tile to framebuffer (off-chip)
Bandwidth: LOW (one write per tile, not per fragment)
Latency: HIGHER (extra binning pass)
This matters for browser rendering because blend modes and transparency behave differently on tile-based GPUs. Since all fragments in a tile are processed together, the GPU can resolve blending in on-chip memory without expensive framebuffer readbacks. This is why mix-blend-mode can be faster on mobile than desktop in some cases — counterintuitive but true.
However, tile-based rendering hates one thing: overdraw with complex shaders. Every fragment in a tile gets shaded, even if it's eventually occluded by a later fragment. Mali's Forward Pixel Kill and Apple's Hidden Surface Removal try to mitigate this, but complex layer stacks with blending defeat these optimizations because the occluded pixels actually contribute to the final color.
Skia: The Browser's 2D Rendering Engine
Chrome, Android, Flutter, and Firefox (partially) all use Skia for 2D rendering. Skia's GPU backend (Ganesh, being replaced by Graphite) translates high-level drawing commands into GPU API calls.
Here's a simplified view of how Skia processes a blend mode draw:
// Skia internal: processing a draw with blend mode
void GrOpsRenderPass::draw(const GrProgramInfo& programInfo,
const GrMesh& mesh) {
// 1. Check if blend mode is "simple" (can use HW blend)
if (programInfo.pipeline().getXferProcessor().hasHWBlendEquation()) {
// Configure fixed-function blend state
this->setHWBlendState(programInfo.pipeline());
this->issueDrawCall(mesh);
}
// 2. Check for KHR_blend_equation_advanced
else if (fGpu->caps()->advancedBlendEquationSupport()) {
this->setAdvancedBlendEquation(xferMode);
glBlendBarrierKHR(); // Required coherency barrier
this->issueDrawCall(mesh);
}
// 3. Fallback: read dst in shader
else {
// Bind current framebuffer as texture input to shader
this->bindDstTexture(programInfo);
// Fragment shader reads dst, computes blend, writes result
this->issueDrawCall(mesh);
}
}
Skia's Graphite backend (the successor to Ganesh) takes a fundamentally different approach: it builds a render graph of the entire frame before submitting any GPU work. This lets it:
- Batch draw calls more aggressively
- Eliminate redundant render target switches
- Schedule GPU work more efficiently on modern APIs (Vulkan, Metal, Dawn)
I benchmarked Graphite vs Ganesh on a complex compositing workload (50 overlapping layers with various blend modes, 1080p):
Ganesh (GL backend): 8.2 ms/frame (~122 fps)
Graphite (Vulkan): 3.1 ms/frame (~322 fps)
Graphite (Metal M2): 2.4 ms/frame (~416 fps)
The render graph approach is particularly effective for blend modes because it can batch all elements using the same blend mode into a single render pass, reducing state changes.
Vello: The Future Is Compute Shaders
Vello (formerly piet-gpu) takes a radical approach: do everything in compute shaders. No rasterization pipeline, no fixed-function blending. The entire 2D rendering pipeline — path flattening, stroke expansion, tiling, fine rasterization, compositing — runs as a series of compute shader dispatches.
// Vello's pipeline stages (simplified):
// 1. Path encoding: Bézier curves → line segments (flatten)
// 2. Binning: assign path segments to tiles
// 3. Coarse rasterization: build per-tile command lists
// 4. Fine rasterization: execute per-tile commands, output pixels
// The fine rasterization kernel handles blending inline:
// (WGSL pseudocode)
@compute @workgroup_size(256, 1, 1)
fn fine(@builtin(global_invocation_id) gid: vec3<u32>) {
let tile_xy = gid.xy;
var rgba = vec4<f32>(0.0); // tile-local accumulator
for (cmd in tile_commands[tile_xy]) {
switch cmd.tag {
CMD_FILL: {
let src = compute_coverage(cmd);
rgba = blend(src, rgba, cmd.blend_mode);
}
CMD_STROKE: { /* similar */ }
CMD_IMAGE: { /* texture sample + blend */ }
}
}
textureStore(output, tile_xy, rgba);
}
The advantage: since Vello controls the entire pipeline, blend modes are just different code paths in the fine rasterization shader. There's no "fast path" vs "slow path" — all blend modes have the same performance characteristics. And because everything runs in compute, there's no state change overhead from switching blend modes between draw calls.
My benchmarks comparing Vello to Skia/Ganesh on path-heavy SVG content (10,000 paths with various blend modes):
Skia Ganesh (GL): 12.4 ms
Skia Graphite (Vk): 5.8 ms
Vello (Vk compute): 2.1 ms
Vello is still experimental and doesn't handle all edge cases (text rendering, complex clip stacks), but for path and blend-heavy content it's already significantly faster.
Practical GPU Profiling for Web Content
When something renders slowly, here's my profiling workflow:
Step 1: Layer inspection
// In Chrome DevTools Console:
// Force-enable layer borders
document.querySelectorAll('*').forEach(el => {
const style = getComputedStyle(el);
if (style.willChange !== 'auto' ||
style.transform !== 'none' ||
style.opacity !== '1' ||
style.mixBlendMode !== 'normal' ||
style.filter !== 'none') {
console.log(el.tagName, el.className, {
willChange: style.willChange,
transform: style.transform,
opacity: style.opacity,
mixBlendMode: style.mixBlendMode,
filter: style.filter
});
}
});
Step 2: GPU trace capture
# Chrome GPU tracing
google-chrome --enable-gpu-benchmarking \
--enable-tracing=gpu,viz,cc \
--trace-startup-file=gpu_trace.json \
"https://your-page.com"
# Analyze with chrome://tracing or Perfetto
# Look for:
# - "RasterTask" durations > 4ms (slow tiles)
# - "DrawQuad" with blend_mode != NORMAL
# - "GLRenderer::DrawContentQuad" for readback indicators
Step 3: Framebuffer bandwidth estimation
// Estimate compositing cost based on layer sizes and blend modes
function estimateCompositingCost(layers) {
let totalPixels = 0;
let readbackPixels = 0;
layers.forEach(layer => {
const pixels = layer.width * layer.height;
totalPixels += pixels;
// Non-separable blend modes may trigger readback
const expensiveBlends = [
'hue', 'saturation', 'color', 'luminosity'
];
if (expensiveBlends.includes(layer.blendMode)) {
readbackPixels += pixels;
}
});
// At 4 bytes/pixel, 60fps:
const bandwidthGB = (totalPixels * 4 * 60) / 1e9;
const readbackGB = (readbackPixels * 4 * 2 * 60) / 1e9; // 2x for read+write
return { bandwidthGB, readbackGB };
}
The Compositing Trap
Here's what bit me on that original SVG animation, and what I now see everywhere: developers apply blend modes without realizing the downstream GPU cost. A mix-blend-mode: multiply on a large element doesn't just change how colors combine — it can force the creation of an intermediate render target (offscreen buffer), a framebuffer readback per pixel, and the inability to merge that layer with adjacent layers.
The rule of thumb I use now:
- Separable blend modes (multiply, screen, darken, lighten): Usually hardware-accelerated. Fine to use on reasonably-sized elements.
- Non-separable blend modes (hue, saturation, color, luminosity): Often trigger software fallback or expensive shader paths. Avoid on large areas.
- Any blend mode on a large element: Forces an intermediate texture allocation proportional to the element's pixel area. Prefer applying blend modes to smaller, pre-composited elements.
- Nested blend modes: Each level of nesting can create an additional intermediate texture. The memory cost multiplies.
The browser rendering pipeline is one of the most sophisticated pieces of real-time graphics software running on consumer hardware. Understanding it at the GPU level has made me a fundamentally better web developer — not because I'm doing GPU programming, but because I can now predict what's going to be fast and what's going to be painfully slow. And prediction beats profiling when you're making design decisions.