Temporal.io: Replacing Message Queues with Durable Execution for Complex Orchestration
Durable Execution Defined
The core idea behind Temporal is simple to state and profound in practice: your code is persisted. Not just the data, not just the state — the actual execution progress of your function is recorded so that if the process crashes, it can resume exactly where it left off.
Here's a Temporal workflow in TypeScript:
import { proxyActivities, sleep, ApplicationFailure } from '@temporalio/workflow';
import type * as activities from './activities';
const { chargePayment, reserveInventory, shipOrder, sendConfirmation, refundPayment, releaseInventory } =
proxyActivities<typeof activities>({
startToCloseTimeout: '30s',
retry: {
maximumAttempts: 3,
backoffCoefficient: 2,
initialInterval: '1s',
},
});
export async function processOrder(order: OrderInput): Promise<OrderResult> {
// Step 1: Reserve inventory
const reservation = await reserveInventory(order.items);
// Step 2: Charge payment
let paymentId: string;
try {
paymentId = await chargePayment(order.paymentMethod, order.total);
} catch (err) {
// Compensation: release inventory if payment fails
await releaseInventory(reservation.id);
throw ApplicationFailure.nonRetryable(
`Payment failed: ${err.message}`
);
}
// Step 3: Ship the order
let shipmentId: string;
try {
shipmentId = await shipOrder(reservation.id, order.shippingAddress);
} catch (err) {
// Compensation: refund payment and release inventory
await refundPayment(paymentId);
await releaseInventory(reservation.id);
throw ApplicationFailure.nonRetryable(
`Shipping failed: ${err.message}`
);
}
// Step 4: Send confirmation
await sendConfirmation(order.email, {
orderId: order.id,
paymentId,
shipmentId,
items: order.items,
});
return { orderId: order.id, paymentId, shipmentId, status: 'completed' };
}
Look at this code. It's just a regular async function with try-catch blocks. No state machines. No serialization of intermediate states to a database. No retry logic wrapped around every external call. No dead-letter queue management.
But here's what Temporal does behind the scenes:
- When
reserveInventoryis called, Temporal records a "ScheduleActivityTask" event in the workflow's event history - When the activity completes, a "ActivityTaskCompleted" event is recorded with the result
- If the worker process crashes between steps 2 and 3 (payment), a new worker picks up the workflow
- Temporal replays the workflow function from the beginning, but instead of re-executing
reserveInventory, it reads the recorded result from the event history - The replay reaches the point of failure and continues executing the remaining steps
This is what "durable execution" means. The workflow function appears to run continuously, but it's actually a series of recorded steps that can be replayed on any worker. The await on each activity is a persistence point.
The Determinism Constraint
There's a fundamental constraint that makes this work: workflow functions must be deterministic. Given the same inputs and the same activity results, the workflow must make the same sequence of decisions.
This means you cannot do the following inside a workflow function:
// ALL OF THESE WILL BREAK REPLAY
// 1. Non-deterministic time
const now = Date.now(); // Different on replay!
// Use instead: workflow.now()
// 2. Random numbers
const id = Math.random().toString(36); // Different on replay!
// Use instead: workflow.uuid4()
// 3. Direct I/O
const data = await fetch('https://api.example.com/data'); // Side effect!
// Use instead: an activity
// 4. Non-deterministic iteration
const keys = Object.keys(someMap); // Order not guaranteed!
// Use instead: sorted keys or arrays
This constraint initially feels restrictive. After extended use, it reveals itself as a feature. It forces a clean separation between decisions (workflow) and effects (activities). The workflow is pure logic. Activities are where all the messy real-world interaction happens.
Activities are regular functions that can do anything — HTTP calls, database queries, file I/O, third-party API calls. They run on workers, and Temporal handles retries, timeouts, and heartbeating:
// activities.ts — these run on workers, NOT in the workflow sandbox
import { Context } from '@temporalio/activity';
export async function chargePayment(
method: PaymentMethod,
amount: number
): Promise<string> {
// Heartbeat for long operations — lets Temporal know we're alive
Context.current().heartbeat('charging');
const result = await stripe.charges.create({
amount: Math.round(amount * 100),
currency: 'usd',
source: method.token,
});
return result.id;
}
export async function shipOrder(
reservationId: string,
address: ShippingAddress
): Promise<string> {
Context.current().heartbeat('creating shipment');
const shipment = await shippingProvider.createShipment({
reservationId,
address,
service: 'standard',
});
return shipment.trackingNumber;
}
Saga Pattern: Compensation Done Right
The order processing workflow above implements the saga pattern — a sequence of transactions with compensating actions if any step fails. In a typical RabbitMQ-based system, the saga coordinator is its own service with its own database, tracking the state of each saga instance.
With Temporal, the saga pattern is just try-catch blocks. But for more complex workflows with many steps, a helper that accumulates compensations is useful:
class SagaCompensation {
private compensations: Array<() => Promise<void>> = [];
addCompensation(fn: () => Promise<void>): void {
this.compensations.unshift(fn); // LIFO — compensate in reverse order
}
async compensate(): Promise<void> {
for (const compensation of this.compensations) {
try {
await compensation();
} catch (err) {
// Log but continue — partial compensation is better than none
console.error('Compensation failed:', err);
}
}
}
}
export async function processComplexOrder(order: ComplexOrder): Promise<void> {
const saga = new SagaCompensation();
try {
// Step 1
const reservation = await reserveInventory(order.items);
saga.addCompensation(() => releaseInventory(reservation.id));
// Step 2
const paymentId = await chargePayment(order.payment, order.total);
saga.addCompensation(() => refundPayment(paymentId));
// Step 3
const discountApplied = await applyLoyaltyDiscount(order.customerId, order.total);
saga.addCompensation(() => revertLoyaltyDiscount(order.customerId, discountApplied));
// Step 4
const shipmentId = await shipOrder(reservation.id, order.address);
saga.addCompensation(() => cancelShipment(shipmentId));
// Step 5
await updateOrderStatus(order.id, 'completed');
await sendConfirmation(order.email, { reservation, paymentId, shipmentId });
} catch (err) {
// Run all compensations in reverse order
await saga.compensate();
await updateOrderStatus(order.id, 'failed');
throw err;
}
}
This is the same pattern a 3,000-line saga coordinator implements, in about 50 lines. And it's reliable — if a worker crashes during compensation, Temporal replays the workflow and the compensation continues from where it left off.
Signals, Queries, and Human-in-the-Loop
Temporal workflows can receive external signals (events that modify workflow state) and respond to queries (read-only inspection of workflow state). This is incredibly powerful for workflows that need human approval or external input:
import { defineSignal, defineQuery, setHandler, condition } from '@temporalio/workflow';
const approveSignal = defineSignal<[{ approvedBy: string; notes: string }]>('approve');
const rejectSignal = defineSignal<[{ rejectedBy: string; reason: string }]>('reject');
const statusQuery = defineQuery<OrderStatus>('status');
export async function orderWithApproval(order: OrderInput): Promise<OrderResult> {
let status: OrderStatus = 'pending_approval';
let approval: { approvedBy: string; notes: string } | null = null;
let rejection: { rejectedBy: string; reason: string } | null = null;
// Set up signal handlers
setHandler(approveSignal, (data) => {
approval = data;
status = 'approved';
});
setHandler(rejectSignal, (data) => {
rejection = data;
status = 'rejected';
});
setHandler(statusQuery, () => status);
// Wait for approval or rejection, with timeout
const approved = await condition(
() => approval !== null || rejection !== null,
'24h' // Timeout after 24 hours
);
if (!approved || rejection) {
status = 'rejected';
await notifyRejection(order.email, rejection?.reason ?? 'Approval timeout');
return { orderId: order.id, status: 'rejected' };
}
// Proceed with order processing
status = 'processing';
// ... rest of the order workflow
}
From the outside, you signal the workflow:
const handle = client.workflow.getHandle('order-12345');
await handle.signal(approveSignal, {
approvedBy: 'manager@company.com',
notes: 'Approved for VIP customer',
});
The workflow can sit waiting for approval for hours, days, or weeks. The worker doesn't keep a connection open. It doesn't poll. The workflow state is persisted in Temporal's database. When the signal arrives, Temporal schedules the workflow for execution, a worker picks it up, replays the history, processes the signal, and continues.
Production Architecture
A production setup runs Temporal Server on Kubernetes with a PostgreSQL backend:
# Simplified Temporal deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: temporal-server
spec:
replicas: 3
template:
spec:
containers:
- name: temporal-server
image: temporalio/server:1.25.2
env:
- name: DB
value: postgresql
- name: DB_PORT
value: "5432"
- name: POSTGRES_HOST
value: temporal-postgres.db.svc.cluster.local
- name: NUM_HISTORY_SHARDS
value: "512"
ports:
- containerPort: 7233 # gRPC frontend
- containerPort: 7234 # History service
- containerPort: 7235 # Matching service
- containerPort: 7239 # Metrics (Prometheus)
Workers run as separate deployments:
// worker.ts
import { Worker } from '@temporalio/worker';
import * as activities from './activities';
async function run() {
const worker = await Worker.create({
workflowsPath: require.resolve('./workflows'),
activities,
taskQueue: 'order-processing',
maxConcurrentActivityTaskExecutions: 100,
maxConcurrentWorkflowTaskExecutions: 50,
stickyQueueScheduleToStartTimeout: '10s',
});
await worker.run();
}
run().catch((err) => {
console.error(err);
process.exit(1);
});
Key tuning parameters:
maxConcurrentActivityTaskExecutions: How many activities can run simultaneously on this worker. Too high and you'll overwhelm downstream services. Too low and you're not using the worker's capacity.stickyQueueScheduleToStartTimeout: How long Temporal tries to route workflow tasks back to the same worker (for cache efficiency). The default is fine for most cases.NUM_HISTORY_SHARDS: Number of history shards on the server side. More shards = better parallelism for high-throughput scenarios. Set at cluster creation and can't be changed.
Components Replaced and Measured Results
Before Temporal:
- 7 RabbitMQ queues, 4 dead-letter queues
- 12 consumer services (Node.js)
- 1 saga coordinator service (3,000 lines)
- 1 PostgreSQL database for saga state
- Custom retry logic in every consumer
- Manual compensation scripts for stuck sagas
After Temporal:
- 1 Temporal cluster (3 nodes)
- 3 worker deployments (order, payment, shipping)
- ~800 lines of workflow code
- ~1,200 lines of activity code
- Zero custom retry logic
- Zero manual compensation scripts
The operational impact:
- Stuck/orphaned sagas: Went from ~15/week to zero. Temporal doesn't lose track of workflows.
- Double-charge incidents: Went from ~3/month to zero. The saga pattern with durable execution actually guarantees compensation runs.
- Mean time to diagnose issues: Dropped from 2-3 hours (correlating logs across 12 services) to 5-10 minutes (look at the workflow event history in the Temporal UI).
- On-call pages: Dropped by 70%. Most of the old alerts were for stuck consumers or queue backlogs.
The event history in Temporal's UI is genuinely transformative for debugging. Every activity call, every result, every retry, every signal — it's all there in chronological order. When a customer reports an issue, the workflow can be looked up by order ID, showing exactly what happened, when, and why.
When Not to Use Temporal
Temporal is not the right tool for everything:
Simple request-response APIs: If your endpoint calls a database and returns a result, you don't need a workflow engine. The overhead of recording events and replaying history is wasted.
High-frequency, low-latency operations: Temporal adds ~20-50ms overhead per workflow task. For operations that need sub-millisecond latency, use direct service calls.
Pure event streaming: If you need to broadcast events to many consumers (fan-out), Kafka is better. Temporal is for orchestrating a specific sequence of steps, not broadcasting.
Workflows with huge payloads: Temporal stores activity inputs and outputs in its event history. If your activities pass around 10MB blobs, your history will be enormous. Pass references (S3 keys, database IDs) instead of data.
The sweet spot is any multi-step process that spans multiple services, needs retries and compensation, and where correctness matters more than raw throughput. Order processing, payment flows, user onboarding sequences, data pipeline orchestration, approval workflows — these are where Temporal shines.
For a typical order processing pipeline, migration takes about six weeks including testing. More time is typically spent decommissioning old infrastructure than building new workflows. That reveals where the complexity actually lives.