TLA+ for Formal Verification: Catching Distributed Systems Bugs Before Writing a Single Line of Code
TLA+ Fundamentals
TLA+ (Temporal Logic of Actions) is Leslie Lamport's formal specification language. You don't write code in TLA+ — you write a mathematical model of your system as a state machine. Then you run the TLC model checker, which exhaustively explores every possible state your system can reach and verifies that your invariants hold in all of them.
The key insight: distributed systems bugs aren't in the code. They're in the design. If your protocol is wrong, no amount of careful coding or testing will save you. TLA+ catches design bugs before you write a single line of implementation code.
A minimal TLA+ specification:
---- MODULE SimpleCounter ----
EXTENDS Integers
VARIABLE count
Init == count = 0
Increment == count < 10 /\ count' = count + 1
Next == Increment
Spec == Init /\ [][Next]_count
CountInvariant == count >= 0 /\ count <= 10
====
This defines a counter that starts at 0, increments up to 10, and we assert it's always between 0 and 10. Trivial, but it shows the structure:
Init: The initial stateNext: All possible transitions (actions)Spec: The system specification (init, then repeatedly take Next steps)CountInvariant: A property that must hold in every reachable state
PlusCal: TLA+ for Humans
Raw TLA+ is mathematical notation. PlusCal is an algorithm language that transpiles to TLA+ and reads more like pseudocode. For most specifications, start with PlusCal.
Here's a mutual exclusion algorithm in PlusCal:
---- MODULE MutualExclusion ----
EXTENDS Integers, TLC
CONSTANT NumProcesses
(*--algorithm mutex
variables
lock = "free";
process proc \in 1..NumProcesses
variables hasLock = FALSE;
begin
AcquireLock:
await lock = "free";
lock := self;
hasLock := TRUE;
CriticalSection:
\* Do work in critical section
skip;
ReleaseLock:
lock := "free";
hasLock := FALSE;
goto AcquireLock;
end process;
end algorithm; *)
\* Mutual exclusion invariant
MutualExclusion ==
\A p1, p2 \in 1..NumProcesses :
(p1 /= p2) =>
~(pc[p1] = "CriticalSection" /\ pc[p2] = "CriticalSection")
====
Run TLC with NumProcesses = 3 and it will explore every possible interleaving of the three processes. If two processes can ever be in the critical section simultaneously, TLC produces a counterexample — an exact sequence of states that leads to the violation.
The await keyword is where the magic happens. It means "this step can only execute when the condition is true." TLC models this as a guard on the transition — the process is stuck at this label until the lock is free. Combined with the implicit non-deterministic interleaving of all processes, this perfectly models the real concurrency semantics.
The Real Kafka Example
The following specification found a real bug. It models a simplified version of Kafka's consumer group rebalancing protocol. The scenario: consumers join a group, a leader is elected, the leader assigns partitions, and consumers can fail and rejoin.
---- MODULE KafkaRebalance ----
EXTENDS Integers, Sequences, FiniteSets, TLC
CONSTANTS
Consumers, \* Set of consumer IDs, e.g., {"c1", "c2", "c3"}
Partitions, \* Set of partition IDs, e.g., {1, 2, 3, 4}
MaxGenerations \* Bound on generation numbers for model checking
VARIABLES
groupState, \* "empty" | "rebalancing" | "stable"
generation, \* Current group generation number
members, \* Set of consumers currently in the group
leader, \* Current leader consumer (or "none")
assignment, \* Function: partition -> consumer (or "unassigned")
consumerState, \* Function: consumer -> "idle" | "joining" | "active" | "failed"
pendingJoins, \* Set of consumers waiting to join
pendingSync \* Set of consumers waiting for sync response
vars == <<groupState, generation, members, leader, assignment,
consumerState, pendingJoins, pendingSync>>
TypeInvariant ==
/\ groupState \in {"empty", "rebalancing", "stable"}
/\ generation \in 0..MaxGenerations
/\ members \subseteq Consumers
/\ leader \in Consumers \union {"none"}
/\ \A p \in Partitions : assignment[p] \in Consumers \union {"unassigned"}
/\ \A c \in Consumers : consumerState[c] \in
{"idle", "joining", "active", "failed"}
Init ==
/\ groupState = "empty"
/\ generation = 0
/\ members = {}
/\ leader = "none"
/\ assignment = [p \in Partitions |-> "unassigned"]
/\ consumerState = [c \in Consumers |-> "idle"]
/\ pendingJoins = {}
/\ pendingSync = {}
\* A consumer requests to join the group
JoinGroup(c) ==
/\ consumerState[c] \in {"idle", "failed"}
/\ generation < MaxGenerations
/\ pendingJoins' = pendingJoins \union {c}
/\ consumerState' = [consumerState EXCEPT ![c] = "joining"]
/\ UNCHANGED <<groupState, generation, members, leader,
assignment, pendingSync>>
\* Coordinator processes pending joins (triggers rebalance)
TriggerRebalance ==
/\ pendingJoins /= {}
/\ groupState' = "rebalancing"
/\ generation' = generation + 1
/\ members' = (members \ {c \in members :
consumerState[c] = "failed"}) \union pendingJoins
/\ leader' = IF members' /= {} THEN CHOOSE c \in members' : TRUE
ELSE "none"
/\ pendingSync' = members'
/\ pendingJoins' = {}
/\ UNCHANGED <<assignment, consumerState>>
\* Leader computes and sends partition assignment
LeaderAssign ==
/\ groupState = "rebalancing"
/\ leader /= "none"
/\ consumerState[leader] = "joining"
\* Simple round-robin assignment
/\ LET memberSeq == SetToSeq(members')
assignFn == [p \in Partitions |->
IF members /= {}
THEN memberSeq[((p - 1) % Len(memberSeq)) + 1]
ELSE "unassigned"]
IN assignment' = assignFn
/\ consumerState' = [c \in Consumers |->
IF c \in pendingSync THEN "active" ELSE consumerState[c]]
/\ groupState' = "stable"
/\ pendingSync' = {}
/\ UNCHANGED <<generation, members, leader, pendingJoins>>
\* A consumer fails (crashes)
ConsumerFail(c) ==
/\ consumerState[c] \in {"joining", "active"}
/\ consumerState' = [consumerState EXCEPT ![c] = "failed"]
/\ members' = members \ {c}
\* BUG WAS HERE: assignment not immediately cleared for failed consumer
\* Original (buggy): UNCHANGED <<assignment>>
\* Fixed: clear assignments for the failed consumer
/\ assignment' = [p \in Partitions |->
IF assignment[p] = c THEN "unassigned" ELSE assignment[p]]
/\ IF leader = c
THEN leader' = IF members' /= {} THEN CHOOSE m \in members' : TRUE
ELSE "none"
ELSE UNCHANGED leader
/\ UNCHANGED <<groupState, generation, pendingJoins, pendingSync>>
Next ==
\/ \E c \in Consumers : JoinGroup(c)
\/ TriggerRebalance
\/ LeaderAssign
\/ \E c \in Consumers : ConsumerFail(c)
Spec == Init /\ [][Next]_vars
\* SAFETY: No partition assigned to a failed consumer
NoFailedAssignment ==
\A p \in Partitions :
assignment[p] /= "unassigned" =>
consumerState[assignment[p]] /= "failed"
\* SAFETY: No partition assigned to two different active consumers
\* (This is the invariant that caught the bug)
UniqueAssignment ==
\A p1, p2 \in Partitions :
/\ p1 /= p2
/\ assignment[p1] /= "unassigned"
/\ assignment[p2] /= "unassigned"
=> TRUE \* each partition assigned to at most one consumer (implicit)
\* SAFETY: All assigned consumers are members of the group
AssignedAreMembers ==
\A p \in Partitions :
assignment[p] /= "unassigned" => assignment[p] \in members
SetToSeq(S) ==
CHOOSE seq \in [1..Cardinality(S) -> S] :
\A i, j \in 1..Cardinality(S) : i /= j => seq[i] /= seq[j]
====
The Bug TLC Found
When running this with the original UNCHANGED <<assignment>> in the ConsumerFail action, TLC found a violation of NoFailedAssignment in 6 states:
Error: Invariant NoFailedAssignment is violated.
Error trace:
1: <Init>
groupState = "empty", generation = 0, members = {}
2: JoinGroup("c1")
consumerState["c1"] = "joining", pendingJoins = {"c1"}
3: JoinGroup("c2")
consumerState["c2"] = "joining", pendingJoins = {"c1", "c2"}
4: TriggerRebalance
generation = 1, members = {"c1", "c2"}, leader = "c1"
5: LeaderAssign
assignment = [1 |-> "c1", 2 |-> "c2", 3 |-> "c1", 4 |-> "c2"]
groupState = "stable"
6: ConsumerFail("c2")
consumerState["c2"] = "failed", members = {"c1"}
assignment = [1 |-> "c1", 2 |-> "c2", 3 |-> "c1", 4 |-> "c2"]
^^^^^^^^^^^^^^^^^^^^^^^^ c2 is failed but still assigned partitions 2 and 4!
The fix is obvious once you see it: when a consumer fails, immediately clear its partition assignments. But in the real implementation, the assignment clearing happened during the next rebalance, not at failure detection time. This meant there was a window where a new consumer could join, trigger a rebalance, and the leader's assignment algorithm could see the stale assignment and make conflicting decisions.
This type of window can exist in production systems for months. TLC finds it in 0.3 seconds.
Running the Model Checker
Here's how to actually run TLC:
# Install the TLA+ Toolbox or use the CLI
# Download tla2tools.jar from GitHub
# Create a config file (KafkaRebalance.cfg)
cat > KafkaRebalance.cfg << 'EOF'
CONSTANTS
Consumers = {"c1", "c2", "c3"}
Partitions = {1, 2, 3, 4}
MaxGenerations = 3
INIT Init
NEXT Next
INVARIANTS
TypeInvariant
NoFailedAssignment
AssignedAreMembers
EOF
# Run TLC
java -jar tla2tools.jar -config KafkaRebalance.cfg KafkaRebalance.tla
# Output:
# Model checking completed. No errors found.
# 847,293 states generated, 234,521 distinct states found.
# Finished in 00:00:04 (4 seconds)
847,293 states. Every single possible interleaving of consumer joins, failures, rebalances, and leader assignments with 3 consumers, 4 partitions, and up to 3 generations. Explored exhaustively in 4 seconds. No randomness, no "hopefully we hit the bug" — mathematical certainty that the invariants hold.
Temporal Properties: Liveness
Safety properties ("bad things never happen") catch a lot of bugs, but liveness properties ("good things eventually happen") catch a different class:
\* LIVENESS: Every partition is eventually assigned to an active consumer
\* (assuming consumers eventually join and rebalances complete)
EventuallyAssigned ==
\A p \in Partitions :
<>(assignment[p] /= "unassigned" /\
consumerState[assignment[p]] = "active")
\* LIVENESS: The group eventually becomes stable after a rebalance
EventuallyStable ==
groupState = "rebalancing" ~> groupState = "stable"
The <> operator means "eventually" (at some point in the future). The ~> operator means "leads to" (whenever the left side is true, the right side is eventually true). These require TLC to check temporal properties, which is slower but catches liveness bugs like permanent blocking or starvation.
Practical Patterns Worth Reusing
Modeling Network Partitions
\* Add a variable for network connectivity
VARIABLE connected \* Function: {node1, node2} -> BOOLEAN
\* Action: network partition occurs between two nodes
NetworkPartition(n1, n2) ==
/\ connected' = [connected EXCEPT ![{n1, n2}] = FALSE]
/\ UNCHANGED <<otherVars>>
\* Action: network heals
NetworkHeal(n1, n2) ==
/\ connected' = [connected EXCEPT ![{n1, n2}] = TRUE]
/\ UNCHANGED <<otherVars>>
\* Guard: message send only succeeds if network is connected
SendMessage(from, to, msg) ==
/\ connected[{from, to}]
/\ \* ... rest of send logic
Modeling Message Loss and Reordering
VARIABLE messages \* Set of in-flight messages (unordered by design)
\* Sending adds to the set (not a queue — models reordering)
Send(msg) ==
/\ messages' = messages \union {msg}
\* Receiving removes from the set (non-deterministic choice models reordering)
Receive(msg) ==
/\ msg \in messages
/\ messages' = messages \ {msg}
\* Messages can also be lost (dropped from the set without delivery)
LoseMessage(msg) ==
/\ msg \in messages
/\ messages' = messages \ {msg}
/\ UNCHANGED <<otherVars>> \* No state change from delivery
When Not to Use TLA+
TLA+ is overkill for most software. Use it when:
- Correctness is critical and testing is insufficient. Distributed consensus protocols, lock-free data structures, transaction isolation levels.
- The interaction space is too large for testing. 5 nodes, each with 4 states, 3 types of messages, failure at any point — that's billions of interleavings.
- The cost of a bug is high. Data corruption, financial loss, safety-critical systems.
Don't use it for:
- CRUD applications
- Simple request-response services
- Anything where the hard part is the UI or business logic rather than the concurrency protocol
The learning curve is real -- roughly 3 weeks of dedicated study to go from zero to productive. But given the track record of TLC catching bugs in protocols that seemed obviously correct, modeling every non-trivial distributed protocol before implementation is a sound investment. A spec usually takes 2-3 days. Finding and fixing a production concurrency bug takes 2-3 weeks. The math is straightforward.