Skip to main content
Distributed Systems

TLA+ for Formal Verification: Catching Distributed Systems Bugs Before Writing a Single Line of Code

10 min read
LD
Lucio Durán
Engineering Manager & AI Solutions Architect
Also available in: Español, Italiano

TLA+ Fundamentals

TLA+ (Temporal Logic of Actions) is Leslie Lamport's formal specification language. You don't write code in TLA+ — you write a mathematical model of your system as a state machine. Then you run the TLC model checker, which exhaustively explores every possible state your system can reach and verifies that your invariants hold in all of them.

The key insight: distributed systems bugs aren't in the code. They're in the design. If your protocol is wrong, no amount of careful coding or testing will save you. TLA+ catches design bugs before you write a single line of implementation code.

A minimal TLA+ specification:

---- MODULE SimpleCounter ----
EXTENDS Integers

VARIABLE count

Init == count = 0

Increment == count < 10 /\ count' = count + 1

Next == Increment

Spec == Init /\ [][Next]_count

CountInvariant == count >= 0 /\ count <= 10

====

This defines a counter that starts at 0, increments up to 10, and we assert it's always between 0 and 10. Trivial, but it shows the structure:

  • Init: The initial state
  • Next: All possible transitions (actions)
  • Spec: The system specification (init, then repeatedly take Next steps)
  • CountInvariant: A property that must hold in every reachable state

PlusCal: TLA+ for Humans

Raw TLA+ is mathematical notation. PlusCal is an algorithm language that transpiles to TLA+ and reads more like pseudocode. For most specifications, start with PlusCal.

Here's a mutual exclusion algorithm in PlusCal:

---- MODULE MutualExclusion ----
EXTENDS Integers, TLC

CONSTANT NumProcesses

(*--algorithm mutex
variables
 lock = "free";

process proc \in 1..NumProcesses
variables hasLock = FALSE;
begin
 AcquireLock:
 await lock = "free";
 lock := self;
 hasLock := TRUE;

 CriticalSection:
 \* Do work in critical section
 skip;

 ReleaseLock:
 lock := "free";
 hasLock := FALSE;

 goto AcquireLock;
end process;
end algorithm; *)

\* Mutual exclusion invariant
MutualExclusion ==
 \A p1, p2 \in 1..NumProcesses :
 (p1 /= p2) =>
 ~(pc[p1] = "CriticalSection" /\ pc[p2] = "CriticalSection")

====

Run TLC with NumProcesses = 3 and it will explore every possible interleaving of the three processes. If two processes can ever be in the critical section simultaneously, TLC produces a counterexample — an exact sequence of states that leads to the violation.

The await keyword is where the magic happens. It means "this step can only execute when the condition is true." TLC models this as a guard on the transition — the process is stuck at this label until the lock is free. Combined with the implicit non-deterministic interleaving of all processes, this perfectly models the real concurrency semantics.

The Real Kafka Example

The following specification found a real bug. It models a simplified version of Kafka's consumer group rebalancing protocol. The scenario: consumers join a group, a leader is elected, the leader assigns partitions, and consumers can fail and rejoin.

---- MODULE KafkaRebalance ----
EXTENDS Integers, Sequences, FiniteSets, TLC

CONSTANTS
 Consumers, \* Set of consumer IDs, e.g., {"c1", "c2", "c3"}
 Partitions, \* Set of partition IDs, e.g., {1, 2, 3, 4}
 MaxGenerations \* Bound on generation numbers for model checking

VARIABLES
 groupState, \* "empty" | "rebalancing" | "stable"
 generation, \* Current group generation number
 members, \* Set of consumers currently in the group
 leader, \* Current leader consumer (or "none")
 assignment, \* Function: partition -> consumer (or "unassigned")
 consumerState, \* Function: consumer -> "idle" | "joining" | "active" | "failed"
 pendingJoins, \* Set of consumers waiting to join
 pendingSync \* Set of consumers waiting for sync response

vars == <<groupState, generation, members, leader, assignment,
 consumerState, pendingJoins, pendingSync>>

TypeInvariant ==
 /\ groupState \in {"empty", "rebalancing", "stable"}
 /\ generation \in 0..MaxGenerations
 /\ members \subseteq Consumers
 /\ leader \in Consumers \union {"none"}
 /\ \A p \in Partitions : assignment[p] \in Consumers \union {"unassigned"}
 /\ \A c \in Consumers : consumerState[c] \in
 {"idle", "joining", "active", "failed"}

Init ==
 /\ groupState = "empty"
 /\ generation = 0
 /\ members = {}
 /\ leader = "none"
 /\ assignment = [p \in Partitions |-> "unassigned"]
 /\ consumerState = [c \in Consumers |-> "idle"]
 /\ pendingJoins = {}
 /\ pendingSync = {}

\* A consumer requests to join the group
JoinGroup(c) ==
 /\ consumerState[c] \in {"idle", "failed"}
 /\ generation < MaxGenerations
 /\ pendingJoins' = pendingJoins \union {c}
 /\ consumerState' = [consumerState EXCEPT ![c] = "joining"]
 /\ UNCHANGED <<groupState, generation, members, leader,
 assignment, pendingSync>>

\* Coordinator processes pending joins (triggers rebalance)
TriggerRebalance ==
 /\ pendingJoins /= {}
 /\ groupState' = "rebalancing"
 /\ generation' = generation + 1
 /\ members' = (members \ {c \in members :
 consumerState[c] = "failed"}) \union pendingJoins
 /\ leader' = IF members' /= {} THEN CHOOSE c \in members' : TRUE
 ELSE "none"
 /\ pendingSync' = members'
 /\ pendingJoins' = {}
 /\ UNCHANGED <<assignment, consumerState>>

\* Leader computes and sends partition assignment
LeaderAssign ==
 /\ groupState = "rebalancing"
 /\ leader /= "none"
 /\ consumerState[leader] = "joining"
 \* Simple round-robin assignment
 /\ LET memberSeq == SetToSeq(members')
 assignFn == [p \in Partitions |->
 IF members /= {}
 THEN memberSeq[((p - 1) % Len(memberSeq)) + 1]
 ELSE "unassigned"]
 IN assignment' = assignFn
 /\ consumerState' = [c \in Consumers |->
 IF c \in pendingSync THEN "active" ELSE consumerState[c]]
 /\ groupState' = "stable"
 /\ pendingSync' = {}
 /\ UNCHANGED <<generation, members, leader, pendingJoins>>

\* A consumer fails (crashes)
ConsumerFail(c) ==
 /\ consumerState[c] \in {"joining", "active"}
 /\ consumerState' = [consumerState EXCEPT ![c] = "failed"]
 /\ members' = members \ {c}
 \* BUG WAS HERE: assignment not immediately cleared for failed consumer
 \* Original (buggy): UNCHANGED <<assignment>>
 \* Fixed: clear assignments for the failed consumer
 /\ assignment' = [p \in Partitions |->
 IF assignment[p] = c THEN "unassigned" ELSE assignment[p]]
 /\ IF leader = c
 THEN leader' = IF members' /= {} THEN CHOOSE m \in members' : TRUE
 ELSE "none"
 ELSE UNCHANGED leader
 /\ UNCHANGED <<groupState, generation, pendingJoins, pendingSync>>

Next ==
 \/ \E c \in Consumers : JoinGroup(c)
 \/ TriggerRebalance
 \/ LeaderAssign
 \/ \E c \in Consumers : ConsumerFail(c)

Spec == Init /\ [][Next]_vars

\* SAFETY: No partition assigned to a failed consumer
NoFailedAssignment ==
 \A p \in Partitions :
 assignment[p] /= "unassigned" =>
 consumerState[assignment[p]] /= "failed"

\* SAFETY: No partition assigned to two different active consumers
\* (This is the invariant that caught the bug)
UniqueAssignment ==
 \A p1, p2 \in Partitions :
 /\ p1 /= p2
 /\ assignment[p1] /= "unassigned"
 /\ assignment[p2] /= "unassigned"
 => TRUE \* each partition assigned to at most one consumer (implicit)

\* SAFETY: All assigned consumers are members of the group
AssignedAreMembers ==
 \A p \in Partitions :
 assignment[p] /= "unassigned" => assignment[p] \in members

SetToSeq(S) ==
 CHOOSE seq \in [1..Cardinality(S) -> S] :
 \A i, j \in 1..Cardinality(S) : i /= j => seq[i] /= seq[j]

====

The Bug TLC Found

When running this with the original UNCHANGED <<assignment>> in the ConsumerFail action, TLC found a violation of NoFailedAssignment in 6 states:

Error: Invariant NoFailedAssignment is violated.

Error trace:
1: <Init>
 groupState = "empty", generation = 0, members = {}

2: JoinGroup("c1")
 consumerState["c1"] = "joining", pendingJoins = {"c1"}

3: JoinGroup("c2")
 consumerState["c2"] = "joining", pendingJoins = {"c1", "c2"}

4: TriggerRebalance
 generation = 1, members = {"c1", "c2"}, leader = "c1"

5: LeaderAssign
 assignment = [1 |-> "c1", 2 |-> "c2", 3 |-> "c1", 4 |-> "c2"]
 groupState = "stable"

6: ConsumerFail("c2")
 consumerState["c2"] = "failed", members = {"c1"}
 assignment = [1 |-> "c1", 2 |-> "c2", 3 |-> "c1", 4 |-> "c2"]
 ^^^^^^^^^^^^^^^^^^^^^^^^ c2 is failed but still assigned partitions 2 and 4!

The fix is obvious once you see it: when a consumer fails, immediately clear its partition assignments. But in the real implementation, the assignment clearing happened during the next rebalance, not at failure detection time. This meant there was a window where a new consumer could join, trigger a rebalance, and the leader's assignment algorithm could see the stale assignment and make conflicting decisions.

This type of window can exist in production systems for months. TLC finds it in 0.3 seconds.

Running the Model Checker

Here's how to actually run TLC:

# Install the TLA+ Toolbox or use the CLI
# Download tla2tools.jar from GitHub

# Create a config file (KafkaRebalance.cfg)
cat > KafkaRebalance.cfg << 'EOF'
CONSTANTS
 Consumers = {"c1", "c2", "c3"}
 Partitions = {1, 2, 3, 4}
 MaxGenerations = 3

INIT Init
NEXT Next

INVARIANTS
 TypeInvariant
 NoFailedAssignment
 AssignedAreMembers
EOF

# Run TLC
java -jar tla2tools.jar -config KafkaRebalance.cfg KafkaRebalance.tla

# Output:
# Model checking completed. No errors found.
# 847,293 states generated, 234,521 distinct states found.
# Finished in 00:00:04 (4 seconds)

847,293 states. Every single possible interleaving of consumer joins, failures, rebalances, and leader assignments with 3 consumers, 4 partitions, and up to 3 generations. Explored exhaustively in 4 seconds. No randomness, no "hopefully we hit the bug" — mathematical certainty that the invariants hold.

Temporal Properties: Liveness

Safety properties ("bad things never happen") catch a lot of bugs, but liveness properties ("good things eventually happen") catch a different class:

\* LIVENESS: Every partition is eventually assigned to an active consumer
\* (assuming consumers eventually join and rebalances complete)
EventuallyAssigned ==
 \A p \in Partitions :
 <>(assignment[p] /= "unassigned" /\
 consumerState[assignment[p]] = "active")

\* LIVENESS: The group eventually becomes stable after a rebalance
EventuallyStable ==
 groupState = "rebalancing" ~> groupState = "stable"

The <> operator means "eventually" (at some point in the future). The ~> operator means "leads to" (whenever the left side is true, the right side is eventually true). These require TLC to check temporal properties, which is slower but catches liveness bugs like permanent blocking or starvation.

Practical Patterns Worth Reusing

Modeling Network Partitions

\* Add a variable for network connectivity
VARIABLE connected \* Function: {node1, node2} -> BOOLEAN

\* Action: network partition occurs between two nodes
NetworkPartition(n1, n2) ==
 /\ connected' = [connected EXCEPT ![{n1, n2}] = FALSE]
 /\ UNCHANGED <<otherVars>>

\* Action: network heals
NetworkHeal(n1, n2) ==
 /\ connected' = [connected EXCEPT ![{n1, n2}] = TRUE]
 /\ UNCHANGED <<otherVars>>

\* Guard: message send only succeeds if network is connected
SendMessage(from, to, msg) ==
 /\ connected[{from, to}]
 /\ \* ... rest of send logic

Modeling Message Loss and Reordering

VARIABLE messages \* Set of in-flight messages (unordered by design)

\* Sending adds to the set (not a queue — models reordering)
Send(msg) ==
 /\ messages' = messages \union {msg}

\* Receiving removes from the set (non-deterministic choice models reordering)
Receive(msg) ==
 /\ msg \in messages
 /\ messages' = messages \ {msg}

\* Messages can also be lost (dropped from the set without delivery)
LoseMessage(msg) ==
 /\ msg \in messages
 /\ messages' = messages \ {msg}
 /\ UNCHANGED <<otherVars>> \* No state change from delivery

When Not to Use TLA+

TLA+ is overkill for most software. Use it when:

  1. Correctness is critical and testing is insufficient. Distributed consensus protocols, lock-free data structures, transaction isolation levels.
  2. The interaction space is too large for testing. 5 nodes, each with 4 states, 3 types of messages, failure at any point — that's billions of interleavings.
  3. The cost of a bug is high. Data corruption, financial loss, safety-critical systems.

Don't use it for:

  • CRUD applications
  • Simple request-response services
  • Anything where the hard part is the UI or business logic rather than the concurrency protocol

The learning curve is real -- roughly 3 weeks of dedicated study to go from zero to productive. But given the track record of TLC catching bugs in protocols that seemed obviously correct, modeling every non-trivial distributed protocol before implementation is a sound investment. A spec usually takes 2-3 days. Finding and fixing a production concurrency bug takes 2-3 weeks. The math is straightforward.

tla-plusformal-verificationdistributed-systemsmodel-checkingtemporal-logickafkaconsensus

Tools mentioned in this article

AWSTry AWS
DigitalOceanTry DigitalOcean
Disclosure: Some links in this article are affiliate links. If you sign up through them, I may earn a commission at no extra cost to you. I only recommend tools I personally use and trust.
Compartir
Seguime