Leader Election
The previous stages built a solid single-node store. In this stage, you’ll extend it into a 5-node cluster using Raft leader election, so the cluster can survive node failures and always agree on who’s in charge.
Cluster Formation
Section titled “Cluster Formation”The test harness starts each node as a Docker container on a shared network and sets two environment variables:
ADDR=10.0.42.101:8080PEERS=10.0.42.102:8080,10.0.42.103:8080,10.0.42.104:8080,10.0.42.105:8080ADDR is this node’s own address and ID. PEERS is a comma-separated list of the other nodes’ addresses. Each node listens on port 8080. The cluster is static: membership doesn’t change in this stage.
When PEERS is empty or unset, your node should still work on its own. You can verify this with clstr test --so-far.
Leader Election
Section titled “Leader Election”All nodes start as followers. If a follower doesn’t receive heartbeats within the election timeout (randomized between 500-1,000ms), it becomes a candidate and starts an election.
This is more generous than the Raft paper’s suggested 150-300ms election timeout. Each node runs in a Docker container on a shared host, which adds container network latency and scheduling jitter on top of any GC or runtime pauses in your implementation. Higher values give varying implementations room to handle these issues without spurious elections.
Candidates request votes from other nodes. A candidate becomes leader if it receives votes from a majority ( where is the cluster size, e.g., 3 votes in a 5-node cluster). Each node grants at most one vote per term.
Terms act as a logical clock. When a node discovers a higher term, it immediately updates its term and reverts to follower.
Heartbeats
Section titled “Heartbeats”Leaders send AppendEntries RPC heartbeats every 100ms to maintain authority. The entries array is empty in this stage.
If a follower doesn’t receive heartbeats within the election timeout, it starts a new election.
If a leader fails to receive acknowledgments from a quorum of nodes within 500ms (the minimum election timeout), it steps down to follower and clears its known leader, rather than continuing to serve requests it cannot commit.
Client Requests
Section titled “Client Requests”Leaders handle the same key-value GET, PUT, and DELETE requests from earlier stages.
Followers redirect all requests to the leader with 307 Temporary Redirect and a Location header:
HTTP/1.1 307 Temporary RedirectLocation: http://10.0.42.101:8080/kv/mykeyNodes that don’t yet know the leader return 503 Service Unavailable.
Storage
Section titled “Storage”Persist currentTerm and votedFor to disk before responding to any RPC that changes them. Use fsync to ensure durability. Without this, a node that restarts could grant a second vote in the same term, or accept RPCs with a stale term, violating Raft’s safety guarantees.
POST /raft/request-vote
Section titled “POST /raft/request-vote”Invoked by candidates to gather votes during leader election.
POST /raft/request-voteContent-Type: application/json
{ "term": 3, "candidate-id": "10.0.42.102:8080", "last-log-index": 0, "last-log-term": 0}
----
200Content-Type: application/json
{ "term": 3, "vote-granted": true}term: candidate’s current termcandidate-id: the candidate’s own address (fromADDR)last-log-index/last-log-term: set to0in this stage; no log entries yetvote-granted: whether the vote was granted- response
term: the responder’s current term, so a stale candidate can update itself
POST /raft/append-entries
Section titled “POST /raft/append-entries”Used for heartbeats (empty entries) to maintain leader authority.
POST /raft/append-entriesContent-Type: application/json
{ "term": 3, "leader-id": "10.0.42.101:8080", "prev-log-index": 0, "prev-log-term": 0, "entries": [], "leader-commit": 0}
----
200Content-Type: application/json
{ "term": 3, "success": true}leader-id: the leader’s own address (fromADDR), so followers can redirect clientsentries: empty in this stage; will carry log entries in later stagesprev-log-index/prev-log-term: set to0in this stage; no log entries yetsuccess:trueif the follower’s log matchedprev-log-indexandprev-log-term- response
term: the responder’s current term, so a stale leader can step down
GET /cluster/info
Section titled “GET /cluster/info”Returns the node’s current cluster state.
GET /cluster/info
----
200Content-Type: application/json
{ "id": "10.0.42.101:8080", "role": "leader", "term": 3, "leader": "10.0.42.101:8080", "peers": ["10.0.42.102:8080", "10.0.42.103:8080", "10.0.42.104:8080", "10.0.42.105:8080"]}id: this node’s own address (fromADDR)role: this node’s current role, one ofleader,candidate, orfollowerterm: current term; starts at0before any electionleader: the known leader’s address, ornullbefore an election and after leader failurepeers: all cluster members except this node, sorted lexicographically
Testing
Section titled “Testing”Your server will be started as a 5-node cluster, with its own address in ADDR and the other four nodes’ addresses in PEERS. The tests will verify leader election behavior:
$ clstr test leader-electionTesting leader-election: Cluster Elects and Maintains Leader
✓ /cluster/info Returns Cluster State (3ms)✓ Leader Election Completes (506ms)✓ Exactly One Leader Per Term (3.04s)✓ Leader Maintains Authority via Heartbeats (3.02s)✓ Followers Redirect to Leader (2ms)✓ Leader Handles KV Operations (15ms)✓ New Leader Elected After Leader Crash (1.57s)✓ Partition Enforces Quorum (6.24s)✓ Leaderless Nodes Return 503 (2ms)✓ Cluster Converges After Partition Heals (3.07s)✓ Slow Leader Steps Down and Cluster Re-elects (911ms)✓ Election Completes Under Packet Loss (1.07s)
PASSED ✓
Run clstr next to advance to the next stage.If a test fails, clstr shows what went wrong:
$ clstr test leader-electionTesting leader-election: Cluster Elects and Maintains Leader
✓ /cluster/info Returns Cluster State (3ms)✓ Leader Election Completes (506ms)✓ Exactly One Leader Per Term (3.04s)✓ Leader Maintains Authority via Heartbeats (3.02s)✓ Followers Redirect to Leader (2ms)✓ Leader Handles KV Operations (15ms)✓ New Leader Elected After Leader Crash (1.57s)✗ Partition Enforces Quorum (3.04s)
GET - 1 of 2 nodes passed (expected all 2) http://n1:8080/cluster/info → 200 Expected field "role": one of [follower, candidate] { "id": "10.0.42.101:8080", "leader": "10.0.42.101:8080", "peers": [ "10.0.42.102:8080", "10.0.42.103:8080", "10.0.42.104:8080", "10.0.42.105:8080" ], "role": "leader", "term": 5 }
The minority partition [n1, n2] must not elect a leader. A candidate needs votes from at least 3 nodes; with only n1 and n2 reachable, no election can succeed.
FAILED ✗
Read the guide: https://clstr.io/kv-store/leader-electionNetwork Partitions
Section titled “Network Partitions”Partition tests cut traffic between groups of nodes using iptables rules inside each container (see the test environment guide). When a node is partitioned, RPCs to it will time out or fail at the TCP level.
Network Impairments
Section titled “Network Impairments”Some tests use tc netem to degrade the network rather than partition it (see the test environment guide). The slow leader test delays the leader’s outgoing traffic past the election timeout, causing followers to time out and hold a new election. The packet loss test introduces 20% random packet loss across all nodes while the leader restarts, verifying that candidates can still collect a majority of votes.
Debugging
Section titled “Debugging”Each node’s output is captured throughout the run, including across restarts and partitions (see log capture). Use clstr logs to see all nodes interleaved, or pass node names to filter:
$ clstr logs n1 n4[n4] +0.000s [START][n1] +0.000s [START][n4] +0.142s Node started addr=10.0.42.104:8080[n1] +0.194s Node started addr=10.0.42.101:8080[**] +0.458s [CLUSTER READY][**] +0.458s [TEST: /cluster/info Returns Cluster State][**] +0.462s [TEST: Leader Election Completes][n4] +0.729s Election timeout, starting election for term 1[n4] +0.731s Received vote from 10.0.42.101:8080 (2/5)[n4] +0.732s Received vote from 10.0.42.102:8080 (3/5)[n4] +0.732s Became leader in term 1[n1] +0.733s Voted for 10.0.42.104:8080 in term 1[n1] +0.968s Received heartbeat from 10.0.42.104:8080, following in term 1[**] +0.969s [TEST: Exactly One Leader Per Term][**] +4.014s [TEST: Leader Maintains Authority via Heartbeats][**] +7.038s [TEST: Followers Redirect to Leader][**] +7.041s [TEST: Leader Handles KV Operations][**] +7.057s [TEST: New Leader Elected After Leader Crash][n1] +7.059s [KILL][n1] +7.948s [START][n1] +8.114s Node started addr=10.0.42.101:8080[n1] +8.142s Received heartbeat from 10.0.42.104:8080, following in term 1[**] +8.628s [TEST: Partition Enforces Quorum][n1] +8.629s [PARTITIONED FROM: n3, n4, n5][n4] +8.628s [PARTITIONED FROM: n1, n2][n1] +9.299s Election timeout, starting election for term 2[n1] +9.301s Received vote from 10.0.42.102:8080 (2/5)[n4] +9.268s Heartbeat to 10.0.42.101:8080 failed[n4] +9.268s Heartbeat to 10.0.42.102:8080 failed[n4] +9.268s Maintaining leadership in term 1, majority reachable[n1] +10.423s Election timeout, starting election for term 3[n1] +10.425s Received vote from 10.0.42.102:8080 (2/5)[**] +14.866s [TEST: Leaderless Nodes Return 503][**] +14.869s [TEST: Cluster Converges After Partition Heals][n1] +14.870s [PARTITION HEALED][n4] +14.870s [PARTITION HEALED][n1] +15.123s Vote request from 10.0.42.102:8080 for term 12[n4] +15.124s Voted for 10.0.42.102:8080 in term 12[n1] +15.125s Voted for 10.0.42.102:8080 in term 12[n4] +15.125s Received heartbeat from 10.0.42.102:8080, following in term 12[n1] +15.125s Received heartbeat from 10.0.42.102:8080, following in term 12Resources
Section titled “Resources”- Raft Visualization by The Secret Lives of Data
- The Raft Consensus Algorithm
- Distributed Systems 6.2: Raft by Martin Kleppmann
- Students’ Guide to Raft by Jon Gjengset
- Database Internals Chapter 10: Leader Election by Alex Petrov