Skip to content

Leader Election

The previous stages built a solid single-node store. In this stage, you’ll extend it into a 5-node cluster using Raft leader election, so the cluster can survive node failures and always agree on who’s in charge.

The test harness starts each node as a Docker container on a shared network and sets two environment variables:

ADDR=10.0.42.101:8080
PEERS=10.0.42.102:8080,10.0.42.103:8080,10.0.42.104:8080,10.0.42.105:8080

ADDR is this node’s own address and ID. PEERS is a comma-separated list of the other nodes’ addresses. Each node listens on port 8080. The cluster is static: membership doesn’t change in this stage.

When PEERS is empty or unset, your node should still work on its own. You can verify this with clstr test --so-far.

All nodes start as followers. If a follower doesn’t receive heartbeats within the election timeout (randomized between 500-1,000ms), it becomes a candidate and starts an election.

This is more generous than the Raft paper’s suggested 150-300ms election timeout. Each node runs in a Docker container on a shared host, which adds container network latency and scheduling jitter on top of any GC or runtime pauses in your implementation. Higher values give varying implementations room to handle these issues without spurious elections.

Candidates request votes from other nodes. A candidate becomes leader if it receives votes from a majority (n+12\lceil\frac{n+1}{2}\rceil where nn is the cluster size, e.g., 3 votes in a 5-node cluster). Each node grants at most one vote per term.

Terms act as a logical clock. When a node discovers a higher term, it immediately updates its term and reverts to follower.

Leaders send AppendEntries RPC heartbeats every 100ms to maintain authority. The entries array is empty in this stage.

If a follower doesn’t receive heartbeats within the election timeout, it starts a new election.

If a leader fails to receive acknowledgments from a quorum of nodes within 500ms (the minimum election timeout), it steps down to follower and clears its known leader, rather than continuing to serve requests it cannot commit.

Leaders handle the same key-value GET, PUT, and DELETE requests from earlier stages.

Followers redirect all requests to the leader with 307 Temporary Redirect and a Location header:

HTTP/1.1 307 Temporary Redirect
Location: http://10.0.42.101:8080/kv/mykey

Nodes that don’t yet know the leader return 503 Service Unavailable.

Persist currentTerm and votedFor to disk before responding to any RPC that changes them. Use fsync to ensure durability. Without this, a node that restarts could grant a second vote in the same term, or accept RPCs with a stale term, violating Raft’s safety guarantees.

Invoked by candidates to gather votes during leader election.

POST /raft/request-vote
Content-Type: application/json
{
"term": 3,
"candidate-id": "10.0.42.102:8080",
"last-log-index": 0,
"last-log-term": 0
}
----
200
Content-Type: application/json
{
"term": 3,
"vote-granted": true
}
  • term: candidate’s current term
  • candidate-id: the candidate’s own address (from ADDR)
  • last-log-index / last-log-term: set to 0 in this stage; no log entries yet
  • vote-granted: whether the vote was granted
  • response term: the responder’s current term, so a stale candidate can update itself

Used for heartbeats (empty entries) to maintain leader authority.

POST /raft/append-entries
Content-Type: application/json
{
"term": 3,
"leader-id": "10.0.42.101:8080",
"prev-log-index": 0,
"prev-log-term": 0,
"entries": [],
"leader-commit": 0
}
----
200
Content-Type: application/json
{
"term": 3,
"success": true
}
  • leader-id: the leader’s own address (from ADDR), so followers can redirect clients
  • entries: empty in this stage; will carry log entries in later stages
  • prev-log-index / prev-log-term: set to 0 in this stage; no log entries yet
  • success: true if the follower’s log matched prev-log-index and prev-log-term
  • response term: the responder’s current term, so a stale leader can step down

Returns the node’s current cluster state.

GET /cluster/info
----
200
Content-Type: application/json
{
"id": "10.0.42.101:8080",
"role": "leader",
"term": 3,
"leader": "10.0.42.101:8080",
"peers": ["10.0.42.102:8080", "10.0.42.103:8080", "10.0.42.104:8080", "10.0.42.105:8080"]
}
  • id: this node’s own address (from ADDR)
  • role: this node’s current role, one of leader, candidate, or follower
  • term: current term; starts at 0 before any election
  • leader: the known leader’s address, or null before an election and after leader failure
  • peers: all cluster members except this node, sorted lexicographically

Your server will be started as a 5-node cluster, with its own address in ADDR and the other four nodes’ addresses in PEERS. The tests will verify leader election behavior:

Terminal window
$ clstr test leader-election
Testing leader-election: Cluster Elects and Maintains Leader
✓ /cluster/info Returns Cluster State (3ms)
✓ Leader Election Completes (506ms)
✓ Exactly One Leader Per Term (3.04s)
✓ Leader Maintains Authority via Heartbeats (3.02s)
✓ Followers Redirect to Leader (2ms)
✓ Leader Handles KV Operations (15ms)
✓ New Leader Elected After Leader Crash (1.57s)
✓ Partition Enforces Quorum (6.24s)
✓ Leaderless Nodes Return 503 (2ms)
✓ Cluster Converges After Partition Heals (3.07s)
✓ Slow Leader Steps Down and Cluster Re-elects (911ms)
✓ Election Completes Under Packet Loss (1.07s)
PASSED ✓
Run clstr next to advance to the next stage.

If a test fails, clstr shows what went wrong:

Terminal window
$ clstr test leader-election
Testing leader-election: Cluster Elects and Maintains Leader
✓ /cluster/info Returns Cluster State (3ms)
✓ Leader Election Completes (506ms)
✓ Exactly One Leader Per Term (3.04s)
✓ Leader Maintains Authority via Heartbeats (3.02s)
✓ Followers Redirect to Leader (2ms)
✓ Leader Handles KV Operations (15ms)
✓ New Leader Elected After Leader Crash (1.57s)
✗ Partition Enforces Quorum (3.04s)
GET - 1 of 2 nodes passed (expected all 2)
http://n1:8080/cluster/info → 200
Expected field "role": one of [follower, candidate]
{
"id": "10.0.42.101:8080",
"leader": "10.0.42.101:8080",
"peers": [
"10.0.42.102:8080",
"10.0.42.103:8080",
"10.0.42.104:8080",
"10.0.42.105:8080"
],
"role": "leader",
"term": 5
}
The minority partition [n1, n2] must not elect a leader.
A candidate needs votes from at least 3 nodes; with only n1 and n2 reachable, no election can succeed.
FAILED ✗
Read the guide: https://clstr.io/kv-store/leader-election

Partition tests cut traffic between groups of nodes using iptables rules inside each container (see the test environment guide). When a node is partitioned, RPCs to it will time out or fail at the TCP level.

Some tests use tc netem to degrade the network rather than partition it (see the test environment guide). The slow leader test delays the leader’s outgoing traffic past the election timeout, causing followers to time out and hold a new election. The packet loss test introduces 20% random packet loss across all nodes while the leader restarts, verifying that candidates can still collect a majority of votes.

Each node’s output is captured throughout the run, including across restarts and partitions (see log capture). Use clstr logs to see all nodes interleaved, or pass node names to filter:

Terminal window
$ clstr logs n1 n4
[n4] +0.000s [START]
[n1] +0.000s [START]
[n4] +0.142s Node started addr=10.0.42.104:8080
[n1] +0.194s Node started addr=10.0.42.101:8080
[**] +0.458s [CLUSTER READY]
[**] +0.458s [TEST: /cluster/info Returns Cluster State]
[**] +0.462s [TEST: Leader Election Completes]
[n4] +0.729s Election timeout, starting election for term 1
[n4] +0.731s Received vote from 10.0.42.101:8080 (2/5)
[n4] +0.732s Received vote from 10.0.42.102:8080 (3/5)
[n4] +0.732s Became leader in term 1
[n1] +0.733s Voted for 10.0.42.104:8080 in term 1
[n1] +0.968s Received heartbeat from 10.0.42.104:8080, following in term 1
[**] +0.969s [TEST: Exactly One Leader Per Term]
[**] +4.014s [TEST: Leader Maintains Authority via Heartbeats]
[**] +7.038s [TEST: Followers Redirect to Leader]
[**] +7.041s [TEST: Leader Handles KV Operations]
[**] +7.057s [TEST: New Leader Elected After Leader Crash]
[n1] +7.059s [KILL]
[n1] +7.948s [START]
[n1] +8.114s Node started addr=10.0.42.101:8080
[n1] +8.142s Received heartbeat from 10.0.42.104:8080, following in term 1
[**] +8.628s [TEST: Partition Enforces Quorum]
[n1] +8.629s [PARTITIONED FROM: n3, n4, n5]
[n4] +8.628s [PARTITIONED FROM: n1, n2]
[n1] +9.299s Election timeout, starting election for term 2
[n1] +9.301s Received vote from 10.0.42.102:8080 (2/5)
[n4] +9.268s Heartbeat to 10.0.42.101:8080 failed
[n4] +9.268s Heartbeat to 10.0.42.102:8080 failed
[n4] +9.268s Maintaining leadership in term 1, majority reachable
[n1] +10.423s Election timeout, starting election for term 3
[n1] +10.425s Received vote from 10.0.42.102:8080 (2/5)
[**] +14.866s [TEST: Leaderless Nodes Return 503]
[**] +14.869s [TEST: Cluster Converges After Partition Heals]
[n1] +14.870s [PARTITION HEALED]
[n4] +14.870s [PARTITION HEALED]
[n1] +15.123s Vote request from 10.0.42.102:8080 for term 12
[n4] +15.124s Voted for 10.0.42.102:8080 in term 12
[n1] +15.125s Voted for 10.0.42.102:8080 in term 12
[n4] +15.125s Received heartbeat from 10.0.42.102:8080, following in term 12
[n1] +15.125s Received heartbeat from 10.0.42.102:8080, following in term 12