Leader Election

The previous stages built a solid single-node store. In this stage, you’ll extend it into a 5-node cluster using Raft leader election, so the cluster can survive node failures and always agree on who’s in charge.

Cluster Formation

The test harness starts each node as a Docker container on a shared network and sets two environment variables:

ADDR=10.0.42.101:8080
PEERS=10.0.42.102:8080,10.0.42.103:8080,10.0.42.104:8080,10.0.42.105:8080

ADDR is this node’s own address and ID. PEERS is a comma-separated list of the other nodes’ addresses. Each node listens on port 8080. The cluster is static: membership doesn’t change in this stage.

When PEERS is empty or unset, your node should still work on its own. You can verify this with clstr test --so-far.

Leader Election

All nodes start as followers. If a follower doesn’t receive heartbeats within the election timeout (randomized between 500-1,000ms), it becomes a candidate and starts an election.

This is more generous than the Raft paper’s suggested 150-300ms election timeout. Each node runs in a Docker container on a shared host, which adds container network latency and scheduling jitter on top of any GC or runtime pauses in your implementation. Higher values give varying implementations room to handle these issues without spurious elections.

Candidates request votes from other nodes. A candidate becomes leader if it receives votes from a majority ( $\lceil\frac{n+1}{2}\rceil$ where $n$ is the cluster size, e.g., 3 votes in a 5-node cluster). Each node grants at most one vote per term.

Terms act as a logical clock. When a node discovers a higher term, it immediately updates its term and reverts to follower.

Heartbeats

Leaders send AppendEntries RPC heartbeats every 100ms to maintain authority. The entries array is empty in this stage.

If a follower doesn’t receive heartbeats within the election timeout, it starts a new election.

If a leader fails to receive acknowledgments from a quorum of nodes within 500ms (the minimum election timeout), it steps down to follower and clears its known leader, rather than continuing to serve requests it cannot commit.

Client Requests

Leaders handle the same key-value GET, PUT, and DELETE requests from earlier stages.

Followers redirect all requests to the leader with 307 Temporary Redirect and a Location header:

HTTP/1.1 307 Temporary Redirect
Location: http://10.0.42.101:8080/kv/mykey

Nodes that don’t yet know the leader return 503 Service Unavailable.

Storage

Persist currentTerm and votedFor to disk before responding to any RPC that changes them. Use fsync to ensure durability. Without this, a node that restarts could grant a second vote in the same term, or accept RPCs with a stale term, violating Raft’s safety guarantees.

API

POST /raft/request-vote

Invoked by candidates to gather votes during leader election.

POST /raft/request-vote
Content-Type: application/json

{
  "term": 3,
  "candidate-id": "10.0.42.102:8080",
  "last-log-index": 0,
  "last-log-term": 0
}

----

200
Content-Type: application/json

{
  "term": 3,
  "vote-granted": true
}

term: candidate’s current term
candidate-id: the candidate’s own address (from ADDR)
last-log-index / last-log-term: set to 0 in this stage; no log entries yet
vote-granted: whether the vote was granted
response term: the responder’s current term, so a stale candidate can update itself

POST /raft/append-entries

Used for heartbeats (empty entries) to maintain leader authority.

POST /raft/append-entries
Content-Type: application/json

{
  "term": 3,
  "leader-id": "10.0.42.101:8080",
  "prev-log-index": 0,
  "prev-log-term": 0,
  "entries": [],
  "leader-commit": 0
}

----

200
Content-Type: application/json

{
  "term": 3,
  "success": true
}

leader-id: the leader’s own address (from ADDR), so followers can redirect clients
entries: empty in this stage; will carry log entries in later stages
prev-log-index / prev-log-term: set to 0 in this stage; no log entries yet
success: true if the follower’s log matched prev-log-index and prev-log-term
response term: the responder’s current term, so a stale leader can step down

GET /cluster/info

Returns the node’s current cluster state.

GET /cluster/info

----

200
Content-Type: application/json

{
  "id": "10.0.42.101:8080",
  "role": "leader",
  "term": 3,
  "leader": "10.0.42.101:8080",
  "peers": ["10.0.42.102:8080", "10.0.42.103:8080", "10.0.42.104:8080", "10.0.42.105:8080"]
}

id: this node’s own address (from ADDR)
role: this node’s current role, one of leader, candidate, or follower
term: current term; starts at 0 before any election
leader: the known leader’s address, or null before an election and after leader failure
peers: all cluster members except this node, sorted lexicographically

Testing

Your server will be started as a 5-node cluster, with its own address in ADDR and the other four nodes’ addresses in PEERS. The tests will verify leader election behavior:

$ clstr test leader-election
Testing leader-election: Cluster Elects and Maintains Leader

✓ /cluster/info Returns Cluster State (3ms)
✓ Leader Election Completes (506ms)
✓ Exactly One Leader Per Term (3.04s)
✓ Leader Maintains Authority via Heartbeats (3.02s)
✓ Followers Redirect to Leader (2ms)
✓ Leader Handles KV Operations (15ms)
✓ New Leader Elected After Leader Crash (1.57s)
✓ Partition Enforces Quorum (6.24s)
✓ Leaderless Nodes Return 503 (2ms)
✓ Cluster Converges After Partition Heals (3.07s)
✓ Slow Leader Steps Down and Cluster Re-elects (911ms)
✓ Election Completes Under Packet Loss (1.07s)

PASSED ✓

Run clstr next to advance to the next stage.

If a test fails, clstr shows what went wrong:

$ clstr test leader-election
Testing leader-election: Cluster Elects and Maintains Leader

✓ /cluster/info Returns Cluster State (3ms)
✓ Leader Election Completes (506ms)
✓ Exactly One Leader Per Term (3.04s)
✓ Leader Maintains Authority via Heartbeats (3.02s)
✓ Followers Redirect to Leader (2ms)
✓ Leader Handles KV Operations (15ms)
✓ New Leader Elected After Leader Crash (1.57s)
✗ Partition Enforces Quorum (3.04s)

GET - 1 of 2 nodes passed (expected all 2)
  http://n1:8080/cluster/info → 200
    Expected field "role": one of [follower, candidate]
      {
        "id": "10.0.42.101:8080",
        "leader": "10.0.42.101:8080",
        "peers": [
          "10.0.42.102:8080",
          "10.0.42.103:8080",
          "10.0.42.104:8080",
          "10.0.42.105:8080"
        ],
        "role": "leader",
        "term": 5
      }

  The minority partition [n1, n2] must not elect a leader.
  A candidate needs votes from at least 3 nodes; with only n1 and n2 reachable, no election can succeed.

FAILED ✗

Read the guide: https://clstr.io/kv-store/leader-election

Network Partitions

Partition tests cut traffic between groups of nodes using iptables rules inside each container (see the test environment guide). When a node is partitioned, RPCs to it will time out or fail at the TCP level.

Network Impairments

Some tests use tc netem to degrade the network rather than partition it (see the test environment guide). The slow leader test delays the leader’s outgoing traffic past the election timeout, causing followers to time out and hold a new election. The packet loss test introduces 20% random packet loss across all nodes while the leader restarts, verifying that candidates can still collect a majority of votes.

Debugging

Each node’s output is captured throughout the run, including across restarts and partitions (see log capture). Use clstr logs to see all nodes interleaved, or pass node names to filter:

$ clstr logs n1 n4
[n4]  +0.000s     [START]
[n1]  +0.000s     [START]
[n4]  +0.142s     Node started addr=10.0.42.104:8080
[n1]  +0.194s     Node started addr=10.0.42.101:8080
[**]  +0.458s     [CLUSTER READY]
[**]  +0.458s     [TEST: /cluster/info Returns Cluster State]
[**]  +0.462s     [TEST: Leader Election Completes]
[n4]  +0.729s     Election timeout, starting election for term 1
[n4]  +0.731s     Received vote from 10.0.42.101:8080 (2/5)
[n4]  +0.732s     Received vote from 10.0.42.102:8080 (3/5)
[n4]  +0.732s     Became leader in term 1
[n1]  +0.733s     Voted for 10.0.42.104:8080 in term 1
[n1]  +0.968s     Received heartbeat from 10.0.42.104:8080, following in term 1
[**]  +0.969s     [TEST: Exactly One Leader Per Term]
[**]  +4.014s     [TEST: Leader Maintains Authority via Heartbeats]
[**]  +7.038s     [TEST: Followers Redirect to Leader]
[**]  +7.041s     [TEST: Leader Handles KV Operations]
[**]  +7.057s     [TEST: New Leader Elected After Leader Crash]
[n1]  +7.059s     [KILL]
[n1]  +7.948s     [START]
[n1]  +8.114s     Node started addr=10.0.42.101:8080
[n1]  +8.142s     Received heartbeat from 10.0.42.104:8080, following in term 1
[**]  +8.628s     [TEST: Partition Enforces Quorum]
[n1]  +8.629s     [PARTITIONED FROM: n3, n4, n5]
[n4]  +8.628s     [PARTITIONED FROM: n1, n2]
[n1]  +9.299s     Election timeout, starting election for term 2
[n1]  +9.301s     Received vote from 10.0.42.102:8080 (2/5)
[n4]  +9.268s     Heartbeat to 10.0.42.101:8080 failed
[n4]  +9.268s     Heartbeat to 10.0.42.102:8080 failed
[n4]  +9.268s     Maintaining leadership in term 1, majority reachable
[n1]  +10.423s    Election timeout, starting election for term 3
[n1]  +10.425s    Received vote from 10.0.42.102:8080 (2/5)
[**]  +14.866s    [TEST: Leaderless Nodes Return 503]
[**]  +14.869s    [TEST: Cluster Converges After Partition Heals]
[n1]  +14.870s    [PARTITION HEALED]
[n4]  +14.870s    [PARTITION HEALED]
[n1]  +15.123s    Vote request from 10.0.42.102:8080 for term 12
[n4]  +15.124s    Voted for 10.0.42.102:8080 in term 12
[n1]  +15.125s    Voted for 10.0.42.102:8080 in term 12
[n4]  +15.125s    Received heartbeat from 10.0.42.102:8080, following in term 12
[n1]  +15.125s    Received heartbeat from 10.0.42.102:8080, following in term 12

Resources

Raft Visualization by The Secret Lives of Data
The Raft Consensus Algorithm
Distributed Systems 6.2: Raft by Martin Kleppmann
Students’ Guide to Raft by Jon Gjengset
Database Internals Chapter 10: Leader Election by Alex Petrov