Backend

Temporal.io: Replacing Message Queues with Durable Execution for Complex Orchestration

February 202610 min read

Lucio Durán

Engineering Manager & AI Solutions Architect

Durable Execution Defined

The core idea behind Temporal is simple to state and profound in practice: your code is persisted. Not just the data, not just the state — the actual execution progress of your function is recorded so that if the process crashes, it can resume exactly where it left off.

Here's a Temporal workflow in TypeScript:

import { proxyActivities, sleep, ApplicationFailure } from '@temporalio/workflow';
import type * as activities from './activities';

const { chargePayment, reserveInventory, shipOrder, sendConfirmation, refundPayment, releaseInventory } =
 proxyActivities<typeof activities>({
 startToCloseTimeout: '30s',
 retry: {
 maximumAttempts: 3,
 backoffCoefficient: 2,
 initialInterval: '1s',
 },
 });

export async function processOrder(order: OrderInput): Promise<OrderResult> {
 // Step 1: Reserve inventory
 const reservation = await reserveInventory(order.items);

 // Step 2: Charge payment
 let paymentId: string;
 try {
 paymentId = await chargePayment(order.paymentMethod, order.total);
 } catch (err) {
 // Compensation: release inventory if payment fails
 await releaseInventory(reservation.id);
 throw ApplicationFailure.nonRetryable(
 `Payment failed: ${err.message}`
 );
 }

 // Step 3: Ship the order
 let shipmentId: string;
 try {
 shipmentId = await shipOrder(reservation.id, order.shippingAddress);
 } catch (err) {
 // Compensation: refund payment and release inventory
 await refundPayment(paymentId);
 await releaseInventory(reservation.id);
 throw ApplicationFailure.nonRetryable(
 `Shipping failed: ${err.message}`
 );
 }

 // Step 4: Send confirmation
 await sendConfirmation(order.email, {
 orderId: order.id,
 paymentId,
 shipmentId,
 items: order.items,
 });

 return { orderId: order.id, paymentId, shipmentId, status: 'completed' };
}

Look at this code. It's just a regular async function with try-catch blocks. No state machines. No serialization of intermediate states to a database. No retry logic wrapped around every external call. No dead-letter queue management.

But here's what Temporal does behind the scenes:

When reserveInventory is called, Temporal records a "ScheduleActivityTask" event in the workflow's event history
When the activity completes, a "ActivityTaskCompleted" event is recorded with the result
If the worker process crashes between steps 2 and 3 (payment), a new worker picks up the workflow
Temporal replays the workflow function from the beginning, but instead of re-executing reserveInventory, it reads the recorded result from the event history
The replay reaches the point of failure and continues executing the remaining steps

This is what "durable execution" means. The workflow function appears to run continuously, but it's actually a series of recorded steps that can be replayed on any worker. The await on each activity is a persistence point.

The Determinism Constraint

There's a fundamental constraint that makes this work: workflow functions must be deterministic. Given the same inputs and the same activity results, the workflow must make the same sequence of decisions.

This means you cannot do the following inside a workflow function:

// ALL OF THESE WILL BREAK REPLAY

// 1. Non-deterministic time
const now = Date.now(); // Different on replay!
// Use instead: workflow.now()

// 2. Random numbers
const id = Math.random().toString(36); // Different on replay!
// Use instead: workflow.uuid4()

// 3. Direct I/O
const data = await fetch('https://api.example.com/data'); // Side effect!
// Use instead: an activity

// 4. Non-deterministic iteration
const keys = Object.keys(someMap); // Order not guaranteed!
// Use instead: sorted keys or arrays

This constraint initially feels restrictive. After extended use, it reveals itself as a feature. It forces a clean separation between decisions (workflow) and effects (activities). The workflow is pure logic. Activities are where all the messy real-world interaction happens.

Activities are regular functions that can do anything — HTTP calls, database queries, file I/O, third-party API calls. They run on workers, and Temporal handles retries, timeouts, and heartbeating:

// activities.ts — these run on workers, NOT in the workflow sandbox
import { Context } from '@temporalio/activity';

export async function chargePayment(
 method: PaymentMethod,
 amount: number
): Promise<string> {
 // Heartbeat for long operations — lets Temporal know we're alive
 Context.current().heartbeat('charging');

 const result = await stripe.charges.create({
 amount: Math.round(amount * 100),
 currency: 'usd',
 source: method.token,
 });

 return result.id;
}

export async function shipOrder(
 reservationId: string,
 address: ShippingAddress
): Promise<string> {
 Context.current().heartbeat('creating shipment');

 const shipment = await shippingProvider.createShipment({
 reservationId,
 address,
 service: 'standard',
 });

 return shipment.trackingNumber;
}

Saga Pattern: Compensation Done Right

The order processing workflow above implements the saga pattern — a sequence of transactions with compensating actions if any step fails. In a typical RabbitMQ-based system, the saga coordinator is its own service with its own database, tracking the state of each saga instance.

With Temporal, the saga pattern is just try-catch blocks. But for more complex workflows with many steps, a helper that accumulates compensations is useful:

class SagaCompensation {
 private compensations: Array<() => Promise<void>> = [];

 addCompensation(fn: () => Promise<void>): void {
 this.compensations.unshift(fn); // LIFO — compensate in reverse order
 }

 async compensate(): Promise<void> {
 for (const compensation of this.compensations) {
 try {
 await compensation();
 } catch (err) {
 // Log but continue — partial compensation is better than none
 console.error('Compensation failed:', err);
 }
 }
 }
}

export async function processComplexOrder(order: ComplexOrder): Promise<void> {
 const saga = new SagaCompensation();

 try {
 // Step 1
 const reservation = await reserveInventory(order.items);
 saga.addCompensation(() => releaseInventory(reservation.id));

 // Step 2
 const paymentId = await chargePayment(order.payment, order.total);
 saga.addCompensation(() => refundPayment(paymentId));

 // Step 3
 const discountApplied = await applyLoyaltyDiscount(order.customerId, order.total);
 saga.addCompensation(() => revertLoyaltyDiscount(order.customerId, discountApplied));

 // Step 4
 const shipmentId = await shipOrder(reservation.id, order.address);
 saga.addCompensation(() => cancelShipment(shipmentId));

 // Step 5
 await updateOrderStatus(order.id, 'completed');
 await sendConfirmation(order.email, { reservation, paymentId, shipmentId });

 } catch (err) {
 // Run all compensations in reverse order
 await saga.compensate();
 await updateOrderStatus(order.id, 'failed');
 throw err;
 }
}

This is the same pattern a 3,000-line saga coordinator implements, in about 50 lines. And it's reliable — if a worker crashes during compensation, Temporal replays the workflow and the compensation continues from where it left off.

Signals, Queries, and Human-in-the-Loop

Temporal workflows can receive external signals (events that modify workflow state) and respond to queries (read-only inspection of workflow state). This is incredibly powerful for workflows that need human approval or external input:

import { defineSignal, defineQuery, setHandler, condition } from '@temporalio/workflow';

const approveSignal = defineSignal<[{ approvedBy: string; notes: string }]>('approve');
const rejectSignal = defineSignal<[{ rejectedBy: string; reason: string }]>('reject');
const statusQuery = defineQuery<OrderStatus>('status');

export async function orderWithApproval(order: OrderInput): Promise<OrderResult> {
 let status: OrderStatus = 'pending_approval';
 let approval: { approvedBy: string; notes: string } | null = null;
 let rejection: { rejectedBy: string; reason: string } | null = null;

 // Set up signal handlers
 setHandler(approveSignal, (data) => {
 approval = data;
 status = 'approved';
 });

 setHandler(rejectSignal, (data) => {
 rejection = data;
 status = 'rejected';
 });

 setHandler(statusQuery, () => status);

 // Wait for approval or rejection, with timeout
 const approved = await condition(
 () => approval !== null || rejection !== null,
 '24h' // Timeout after 24 hours
 );

 if (!approved || rejection) {
 status = 'rejected';
 await notifyRejection(order.email, rejection?.reason ?? 'Approval timeout');
 return { orderId: order.id, status: 'rejected' };
 }

 // Proceed with order processing
 status = 'processing';
 // ... rest of the order workflow
}

From the outside, you signal the workflow:

const handle = client.workflow.getHandle('order-12345');
await handle.signal(approveSignal, {
 approvedBy: 'manager@company.com',
 notes: 'Approved for VIP customer',
});

The workflow can sit waiting for approval for hours, days, or weeks. The worker doesn't keep a connection open. It doesn't poll. The workflow state is persisted in Temporal's database. When the signal arrives, Temporal schedules the workflow for execution, a worker picks it up, replays the history, processes the signal, and continues.

Production Architecture

A production setup runs Temporal Server on Kubernetes with a PostgreSQL backend:

# Simplified Temporal deployment
apiVersion: apps/v1
kind: Deployment
metadata:
 name: temporal-server
spec:
 replicas: 3
 template:
 spec:
 containers:
 - name: temporal-server
 image: temporalio/server:1.25.2
 env:
 - name: DB
 value: postgresql
 - name: DB_PORT
 value: "5432"
 - name: POSTGRES_HOST
 value: temporal-postgres.db.svc.cluster.local
 - name: NUM_HISTORY_SHARDS
 value: "512"
 ports:
 - containerPort: 7233 # gRPC frontend
 - containerPort: 7234 # History service
 - containerPort: 7235 # Matching service
 - containerPort: 7239 # Metrics (Prometheus)

Workers run as separate deployments:

// worker.ts
import { Worker } from '@temporalio/worker';
import * as activities from './activities';

async function run() {
 const worker = await Worker.create({
 workflowsPath: require.resolve('./workflows'),
 activities,
 taskQueue: 'order-processing',
 maxConcurrentActivityTaskExecutions: 100,
 maxConcurrentWorkflowTaskExecutions: 50,
 stickyQueueScheduleToStartTimeout: '10s',
 });

 await worker.run();
}

run().catch((err) => {
 console.error(err);
 process.exit(1);
});

Key tuning parameters:

maxConcurrentActivityTaskExecutions: How many activities can run simultaneously on this worker. Too high and you'll overwhelm downstream services. Too low and you're not using the worker's capacity.
stickyQueueScheduleToStartTimeout: How long Temporal tries to route workflow tasks back to the same worker (for cache efficiency). The default is fine for most cases.
NUM_HISTORY_SHARDS: Number of history shards on the server side. More shards = better parallelism for high-throughput scenarios. Set at cluster creation and can't be changed.

Components Replaced and Measured Results

Before Temporal:

7 RabbitMQ queues, 4 dead-letter queues
12 consumer services (Node.js)
1 saga coordinator service (3,000 lines)
1 PostgreSQL database for saga state
Custom retry logic in every consumer
Manual compensation scripts for stuck sagas

After Temporal:

1 Temporal cluster (3 nodes)
3 worker deployments (order, payment, shipping)
~800 lines of workflow code
~1,200 lines of activity code
Zero custom retry logic
Zero manual compensation scripts

The operational impact:

Stuck/orphaned sagas: Went from ~15/week to zero. Temporal doesn't lose track of workflows.
Double-charge incidents: Went from ~3/month to zero. The saga pattern with durable execution actually guarantees compensation runs.
Mean time to diagnose issues: Dropped from 2-3 hours (correlating logs across 12 services) to 5-10 minutes (look at the workflow event history in the Temporal UI).
On-call pages: Dropped by 70%. Most of the old alerts were for stuck consumers or queue backlogs.

The event history in Temporal's UI is genuinely transformative for debugging. Every activity call, every result, every retry, every signal — it's all there in chronological order. When a customer reports an issue, the workflow can be looked up by order ID, showing exactly what happened, when, and why.

When Not to Use Temporal

Temporal is not the right tool for everything:

Simple request-response APIs: If your endpoint calls a database and returns a result, you don't need a workflow engine. The overhead of recording events and replaying history is wasted.

High-frequency, low-latency operations: Temporal adds ~20-50ms overhead per workflow task. For operations that need sub-millisecond latency, use direct service calls.

Pure event streaming: If you need to broadcast events to many consumers (fan-out), Kafka is better. Temporal is for orchestrating a specific sequence of steps, not broadcasting.

Workflows with huge payloads: Temporal stores activity inputs and outputs in its event history. If your activities pass around 10MB blobs, your history will be enormous. Pass references (S3 keys, database IDs) instead of data.

The sweet spot is any multi-step process that spans multiple services, needs retries and compensation, and where correctness matters more than raw throughput. Order processing, payment flows, user onboarding sequences, data pipeline orchestration, approval workflows — these are where Temporal shines.

For a typical order processing pipeline, migration takes about six weeks including testing. More time is typically spent decommissioning old infrastructure than building new workflows. That reveals where the complexity actually lives.

Tips

Tip: Never call Date.now() or Math.random() inside a Temporal workflow function. Use workflow.now() and the deterministic random from @temporalio/workflow instead. Replays will diverge silently, and you'll only discover it when a workflow fails to recover after a worker restart — usually at 3 AM.

Tip: Set startToCloseTimeout and scheduleToCloseTimeout on every activity. Activities can hang forever on external API calls with no timeout, which means the workflow waits forever too. Default to 30 seconds for API calls, 5 minutes for batch operations, and always add a heartbeatTimeout for long-running activities.

Tip: Use workflow.patched() for versioning instead of trying to deploy breaking workflow changes. Running workflows will replay with the old code path while new workflows use the patched version. A big-bang migration attempt once caused 400 in-flight workflows to fail simultaneously on Monday morning.

Tip: Run Temporal's tctl CLI with workflow describe -w <id> before debugging anything in your code. Half the time the issue is a misconfigured retry policy or a timeout that's too aggressive, not your business logic.

temporaldurable-executionworkflow-enginesaga-patternorchestrationdistributed-systemsmessage-queuesbackend-architecture

Tools mentioned in this article

AWSTry AWS

RenderTry Render

Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you. I only recommend tools I personally use and trust.

X / Twitter LinkedIn WhatsApp

Seguime

Back to Blog