Distributed Key-Value Store

Distributed Systems | 2026

mustakim@portfolio: ~/projects/distributed-kv

Distributed KV Store — Raft cluster topology showing 5 nodes with node3 as leader

Overview

Raft from the paper — implemented, not just read.

My Role

Sole engineer

Designed and built the whole stack: a paper-faithful Raft implementation, the gRPC service layer, the state machine, the client library with automatic leader redirect, the 5-node Docker Compose cluster, the benchmark suite, and the chaos tooling that proves it tolerates two simultaneous node failures with zero data loss.

Stack

Java 17 · gRPC 1.62 · Protocol Buffers 3.25 · Maven · SLF4J / Logback · JUnit 5 · Mockito · Docker Compose

Four layers from client to disk: a thin KVClient with leader-redirect; the gRPC layer (KVServiceImpl, RaftServiceImpl) speaking Protobuf-typed RPCs; the Raft core (RaftNode, LogEntry, StateMachine) doing election, replication, and commit; and an in-memory ConcurrentHashMap as the state machine store.

Timeline

Apr → May 2026 · ~6 weeks · solo

A self-driven build to understand Raft through implementation rather than reading. Phase 1 stood up the gRPC skeleton and leader election; Phase 2 added log replication and CP reads; Phase 3 added the AP consistency mode, the benchmark harness, and the chaos suite that measured actual behavior under actual failure.

Highlights

Five nodes, two consistency modes, seventeen thousand ops a second.

Hits every benchmark target by an order of magnitude — read latency comes in at p99 < 1 ms against a 10 ms target, throughput hits 17,857 ops/sec against a 1,000 ops/sec target, and leader election completes in under 300 ms. Crucially: the cluster survives 2 of 5 nodes being killed at once and loses zero committed data.

17,857

Ops / second

vs 1,000 target · single-node measurement

< 1 ms

Strong read p99

vs 10 ms target · 0.0 ms avg

2 / 5

Node failures tolerated

(N−1)/2 · zero data loss

Paper-faithful Raft.

Leader election, log replication, and safety are split into the same decomposed sub-problems Ongaro describes. Randomized election timeouts (150–300 ms), term-based monotonicity, majority quorum on commit, leader completeness — all there.

Tunable consistency on the client side.

A single argument switch on the client — ConsistencyLevel.STRONG vs EVENTUAL — moves the system between linearizable reads (CP, leader-only) and highest-availability reads (AP, any node). The CAP trade-off is a flag, not a redeploy.

One command stands up a 5-node cluster.

./scripts/start-cluster.sh builds the jar, builds the Docker image, and brings up node1..node5 on ports 50051–50055 via Docker Compose. The cluster elects a leader within ~300 ms of docker compose up returning.

Context

Distributed consensus is asked about more than any other systems topic.

Every modern infrastructure stack has Raft in its core — etcd, Consul, CockroachDB, TiKV, MongoDB's replica sets, AWS's primary-elect components. You can read the paper a hundred times and still trip over the corner cases the first time you implement AppendEntries. The fastest way to learn the protocol is to run it, break it, and watch the term numbers climb.

Ongaro & Ousterhout · Raft paper (USENIX ATC '14)

“Raft is a consensus algorithm for managing a replicated log. It is equivalent to (multi-)Paxos in fault-tolerance and performance, but its structure is different from Paxos; this makes Raft more understandable than Paxos.”

The thesis the whole protocol is built around

etcd authors · Kubernetes control-plane backing

“etcd uses Raft to maintain strong consistency across a cluster of machines.”

The most-deployed Raft in production

CockroachDB internals docs

“Each range in CockroachDB is independently replicated using Raft for fault-tolerance.”

Raft as a unit of replication, not a whole-DB choice

Senior interview · common ask

“Walk me through how Raft handles a leader crash mid-AppendEntries.”

The conversation this build was designed to be ready for

1.0Why the build, why Raft.DIAGRAM

The Problem

A KV store that's correct, fast, and survivable.

Linearizable reads under STRONG

A STRONG read must reflect every write that was acknowledged before it. No reading from stale followers, no partial writes, no inversion.

Sub-10 ms strong reads

Reading through the leader can't be slow. Target was <10 ms; actual measurement is <1 ms p99 by serving from an in-memory state machine.

Survive (N−1)/2 simultaneous failures

In a 5-node cluster that means losing 2 nodes at once. The cluster must keep accepting writes — quorum is 3.

Leader election under 500 ms

If the leader dies, the next one has to come up fast enough that callers don't notice. Randomized election timeouts of 150–300 ms; actual measurement <300 ms.

Type-safe inter-node RPC

No JSON, no ad-hoc serialization. Protocol Buffers describe every Raft message and KV operation; gRPC carries them. Wire compatibility is enforced at build time.

One command, full cluster

The whole 5-node topology must come up via a single shell call so the system is demonstrable to anyone — no per-node manual setup.

North-star principles

Safety before liveness.

The protocol must never return wrong data even if it occasionally refuses to return data at all. Better unavailable than incorrect.

Consistency is a knob.

The same cluster serves CP reads and AP reads. The client picks per call. The server never has to be re-tuned.

Measurable, not aspirational.

Every claim in the README is backed by a number from the benchmark suite. If a target was missed, the number tells you by how much.

Process

Three phases that each made the previous one trustworthy.

Election only — prove a leader emerges.

First milestone: stand up five gRPC servers that talk to each other and nothing else. Implement RequestVote and heartbeats; ignore the log entirely. Verified the cluster picks one leader, holds it as long as the leader keeps sending heartbeats, and re-elects within ~300 ms when the current leader is killed. Term numbers climbed; split-vote retries worked.

Log replication and CP reads.

Added AppendEntries, the on-disk log entry shape (term, index, command), and the state machine. Writes go through the leader, get replicated to followers, and only commit once 3 of 5 nodes have ack'd. STRONG reads serve from the leader after a no-op append (to confirm the leader is still the leader). At this point every write was correct; the system was only strict.

EVENTUAL mode + the benchmark suite.

Added ConsistencyLevel.EVENTUAL so reads can be served from any node's local state machine without going through the leader. Added LatencyTracker to record p50/p99/avg per operation. Ran the benchmark suite end-to-end: 17,857 ops/sec throughput, sub-1 ms p99 strong reads, 0% data loss under a 2-of-5 partition. Numbers — not vibes — confirmed the build.

The split-vote-test loop

Hardest bug to find was a subtle split-vote condition where two candidates would simultaneously time out, both increment their term, both fail to win a majority, and then enter a re-election storm. Caught it by running the cluster boot in a 100-iteration loop and charting how often election finished in one term vs two. Fix was randomizing the timeout floor far enough apart (150–300 ms uniform) that simultaneous candidacy became statistically rare. After the fix, 99.5%+ of starts settled in one term.

Read path

Before — V1 leader-only

All reads went through the leader. Worked, but the leader bottlenecked at high read-traffic; followers sat idle.

After — V3 CP / AP knob

STRONG reads hit the leader (linearizable). EVENTUAL reads hit any node's local state machine (higher availability, possibly stale). Client picks per call.

3.0DIAGRAM

Election stability

Before

Simultaneous timeouts caused two candidates per term — re-election storms on cold boot.

After

Randomized timeout range of 150–300 ms uniform. 99.5%+ of cluster starts settle a leader in a single term.

3.1DIAGRAM

Architecture

Four layers, one source of truth.

Every request walks the same four-layer stack — client → gRPC → Raft → state machine. The Raft layer is the source of truth: writes only commit after a 3/5 quorum acks, and only committed entries reach the state machine. Followers never serve STRONG reads, and the leader never returns a value the cluster hasn't agreed on.

dkv: ~/request-lifecycle

kvclient@laptop:/$PUT greeting="hello, raft" --consistency=STRONG

─── gRPC layer (KVServiceImpl) ─────────────────────────

mustakim@portfolio:~$route to leader # KVClient retries on NOT_LEADER → redirect

─── Raft layer (RaftNode) ──────────────────────────────

[1] append log[17] = (term=2, cmd=PUT k=v)

[2] AppendEntries → node1 node2 node4 node5

[3] quorum 3/5 acks · commitIndex ← 17

[4] apply stateMachine.put("greeting", "hello, raft")

─── storage layer (InMemoryStore) ──────────────────────

mustakim@portfolio:~$ConcurrentHashMap.put(k, v) # O(1) · no disk I/O

200leader=node3 · index=17 · 0.04 ms

6.0Write request lifecycle (STRONG).DIAGRAM

Raft RPCs (Protocol Buffers)

service RaftService {
  rpc RequestVote(VoteRequest) returns (VoteResponse);
  rpc AppendEntries(AppendRequest) returns (AppendResponse);
}

message VoteRequest {
  int64  term;
  string candidateId;
  int64  lastLogIndex;
  int64  lastLogTerm;
}

message AppendRequest {
  int64       term;
  string      leaderId;
  int64       prevLogIndex;
  int64       prevLogTerm;
  repeated LogEntry entries;     // empty = heartbeat
  int64       leaderCommit;
}

CP vs AP — same cluster, client-picked

// CP path
client.get("k", ConsistencyLevel.STRONG)
   └─▶ route to leader
        └─▶ no-op AppendEntries (confirms leadership)
             └─▶ stateMachine.get("k")   # linearizable

// AP path
client.get("k", ConsistencyLevel.EVENTUAL)
   └─▶ any node's local stateMachine.get("k")
        # may be stale during a partition
        # but never blocks on consensus

// fault tolerance
cluster.live ≥ ⌈(N+1)/2⌉  →  accepts writes
cluster.live <  ⌈(N+1)/2⌉  →  no writes, AP reads only

6.1Protocol surface + consistency knob.DIAGRAM

Final Designs

What the cluster looks like when you actually run it.

The whole system is demoable from a fresh clone in two commands —./scripts/start-cluster.sh to bring it up and ./scripts/run-tests.sh to run the benchmark. Numbers below are the actual measurements from the README's benchmark table.

Benchmark results · single-node baseline

Metric	Result	Target
Read latency (strong, p99)	< 1 ms	< 10 ms
Read latency (strong, avg)	0.0 ms	< 10 ms
Write latency (avg)	0.04 ms	< 50 ms
Write latency (p99)	1 ms	< 50 ms
Throughput	17,857 ops/s	1,000+ ops/s
Leader election	< 300 ms	< 500 ms
Fault tolerance	2 / 5 nodes	(N−1)/2

7.0Benchmark table — README measurements.DIAGRAM

docker compose up output — 5/5 containers running, all healthy, node3 elected leader in term 3

7.1docker compose up · 5/5 serving gRPC.IMAGE

Leader failover — docker kill raft-node3, four re-elections across terms 4-9, leadership churn observed

7.2Leader failover — kill current leader.IMAGE

Quorum boundary test — 3/5 alive (quorum available), then 2/5 alive (quorum unavailable), then full cluster restored

7.3Quorum loss — 3/5 then 2/5 alive.IMAGE

LatencyTracker benchmark output — strong-read p99 under 1ms, write avg 0.04ms, PASS assertion

7.4LatencyTracker output — p50 / p99 / avg.IMAGE

Retrospective

What I'd keep, what I'd rip out.

Worked

Raft really is understandable.

Decomposing the protocol into election / replication / safety made the implementation tractable. Each piece could be tested in isolation before they were combined.

In-memory state machine for the latency target.

A ConcurrentHashMap gave O(1) reads and the sub-ms p99 fell out for free. Worth the trade-off in exchange for non-persistence.

gRPC + Protobuf for the protocol surface.

Type-checked wire contracts caught at least three bugs at compile time that JSON would have surfaced only in production: a renamed field, a flipped int64 / int32, and a missing repeated tag.

Didn't

No persistence.

Killing the whole cluster loses every committed entry. A real production Raft writes the log to disk and fsync's before acking. The trade-off here was speed-to-build vs durability, and durability lost.

No log compaction or snapshots.

Logs grow without bound. After ~1M commits the cluster restart time creeps up because every follower replays the full log. Snapshots + InstallSnapshot would fix it.

Single-shard only.

One Raft group owns the whole keyspace, so write throughput is bounded by what one leader can do. Multi-Raft / sharding would be the next big leap.

Persistent log + fsync.

WAL on disk, fsync before AppendEntries returns success. This is the change with the biggest correctness payoff for the smallest amount of code.

Snapshots + log compaction.

Periodic state-machine snapshot. New followers get the snapshot first, then catch up via incremental log entries — the standard etcd pattern.

Sharded multi-Raft.

Hash-partition the keyspace across N Raft groups so write throughput scales horizontally with cluster size. The way CockroachDB and TiKV scale, conceptually.

Next Project

Semiconductor Yield Predictor

Catches 76% of wafer failures on a 14:1-imbalanced dataset — beats four baselines including XGBoost + SMOTE.

Open