- Published on
System design for a Global Distributed Cache
- Authors

- Name
- Chengchang Yu
- @chengchangyu
1. Functionality (Requirements)
We are designing a distributed key-value store acting as a cache for a database with 100 billion records.
- Core Operations: Support
get(key)andset(key, value)operations. - TTL Support: Keys must expire after a defined time-to-live.
- Eviction: When memory is full, the system must evict less useful data (e.g., LRU).
- Global Access: The cache is accessible from multiple geographical regions (assuming the "Global" in the prompt implies multi-region support).
2. Non-Functional Requirements
- Latency: Extremely low. Reading from cache should be < 5ms (p99).
- Availability: High (99.99%). A cache miss is acceptable, but a cache downtime that takes down the DB (thundering herd) is not.
- Scalability: Must handle the dataset derived from 100B records.
- Consistency: Eventual consistency is acceptable for a cache, but we must minimize stale data windows.
- Observability: Hit/Miss ratios, memory usage, and eviction rates must be visible.
3. Back-of-the-Envelope Estimation (Sizing)
This is the most critical step to justify your architecture.
Total Records: 100 Billion.
Record Size:
Key (UUID): 16 bytes.
Value (JSON/Blob): Assume average 500 bytes.
Metadata (TTL, pointers): 64 bytes.
Total per record: ~600 bytes.
Total Data Volume: .
Cache Sizing (The "Hot" Set): Storing 60TB in RAM is expensive. Following the Pareto Principle (80/20 rule), we assume 20% of the keys generate 80% of the traffic.
Target Capacity: .
Hardware:
Assume High-Memory instances (e.g., 256GB RAM).
Usable memory per node (after OS overhead): ~200GB.
Node Count: .
To handle replication (factor of 3) and growth: We need roughly 180 - 200 nodes.
4. Key Entities
- Cache Client: A smart library or sidecar (like Envoy) living in the application service. It handles hashing and routing.
- Cache Node: A single server instance holding a shard of the data in RAM.
- Cluster Manager: Maintains the health of nodes and the topology (e.g., ZooKeeper or Etcd).
5. API Design
We will use a simple internal RPC API (e.g., gRPC).
service CacheService {
// Returns value or error "Not Found"
rpc Get(GetRequest) returns (GetResponse);
// Returns success or error
rpc Set(SetRequest) returns (SetResponse);
}
message SetRequest {
string key = 1;
bytes value = 2;
int32 ttl_seconds = 3;
bool write_through = 4; // Option for consistency strategy
}
6. High-Level Design

Distributed Cache Workflow & Architecture
We will use a Distributed Hash Table (DHT) architecture using Consistent Hashing.
The Flow:
- Request: Service A requests key
user_123. - Hash: The Cache Client hashes
user_123(e.g., MD5 or MurmurHash). - Route: The client looks up the Configuration Map (managed by ZooKeeper) to see which Cache Node owns the hash range for
user_123. - Execute: The client connects directly to the specific Cache Node (e.g., Node 4) to fetch the data.
- Fallback: If the cache returns "Miss", the client fetches from the DB and performs a
Seton the cache (Cache-Aside pattern).
7. Dive Deep (Addressing the Specific Issues)
A. Data Distribution & Consistency Hashing
With 200 nodes, simple modulo hashing (Key % N) is dangerous; if one node crashes, N changes, remapping almost all keys (Massive Cache Stampede).
- Solution: Consistent Hashing. Nodes are placed on a "Ring". A key is assigned to the first node found moving clockwise on the ring.
- Virtual Nodes: To prevent hotspots (where one node gets a disproportionate amount of traffic), each physical node is mapped to multiple points on the ring (e.g., 100 virtual nodes per physical machine). This ensures uniform distribution.
B. Replication & High Availability
If a node crashes, the data on it is lost. We cannot afford to hammer the DB for that 200GB of data.
Leader-Follower: Every partition on the hash ring has 1 Leader and 2 Followers.
Writes: Go to Leader \rightarrow replicated to Followers asynchronously (for speed) or synchronously (for consistency).
Reads: Can go to Leader (strong consistency) or Followers (eventual consistency).
Failover: If a Leader crashes, the Cluster Manager (ZooKeeper) detects the heartbeat loss and promotes a Follower to Leader.
C. Handling "Crashing of Cache Servers"
When a node crashes completely:
- Detection: The Cluster Manager detects the node is unreachable via gossip protocol/heartbeats.
- Topology Update: The map is updated to remove the failed node.
- Request Re-routing: Clients pull the new config. Traffic for that hash range is routed to the next node on the ring (which holds the replica).
- Recovery: A new node is spun up. It requests a "state transfer" from its peers to populate its cache.
D. Cache Warm-up
A cold cache is useless and dangerous (DB overload).
- Scenario: We are deploying a new region or recovering a massive cluster failure.
- Strategy 1: Snapshot/dump: Periodically dump cache snapshots to disk (like Redis RDB). On boot, load the file.
- Strategy 2: Population from DB Logs: Analyze DB read logs to identify the "hot keys" and pre-populate the cache before allowing traffic to hit it.
E. TTL & Eviction
How do we actually delete data?
- Eviction (Memory Full): Use Approximated LRU. Exact LRU requires massive linked-list manipulation which locks memory. Instead, sample 5 random keys and evict the one with the oldest access time.
- TTL (Time Expiration):
- Lazy Deletion: When a client tries to
geta key, check the timestamp. If expired, return null and delete. - Active Deletion: A background thread randomly samples keys with TTLs and deletes expired ones to free memory.
F. Global Distribution (The "Global" Requirement)
If we have a cluster in US-East and one in EU-West:
- Read Local: Clients always read from the local region cache.
- Write Propagation:
- Option A (Simple): Cache invalidation. If DB updates in US-East, send an invalidation event (via Kafka) to EU-West cache to delete the key.
- Option B (Active-Active): Use CRDTs (Conflict-Free Replicated Data Types) if we are caching write-heavy counters, though this is complex for general caching.
- Recommendation: Use Cache Invalidation. It is safer to have a cache miss in Europe than to serve stale data indefinitely.
G. The "Thundering Herd" Problem
If a "mega-hot" key (e.g., Justin Bieber's profile) expires, 10,000 requests might hit the cache simultaneously, find it missing, and all hit the database at once.
- Solution 1: Probabilistic Early Expiration: If TTL is 60s, fetch the data. If the current time is > 55s, roll a dice. If you "lose", fetch from DB and refresh the cache before it actually expires.
- Solution 2: Locking: The first client to miss gets a lock (in the cache or ZK) to populate the data. Others wait or return stale data.
Summary of Design
| Component | Technology/Strategy | Why? |
|---|---|---|
| Partitioning | Consistent Hashing with V-Nodes | Minimal movement during resizing; uniform load. |
| Eviction | Approximated LRU | High performance, low memory overhead. |
| Availability | Master-Slave Replication (x3) | Read scalability and fault tolerance. |
| Communication | TCP / Protobuf | Low overhead networking. |
| Discovery | ZooKeeper / Etcd | Managing the ring state and failure detection. |
| Global Sync | Async Invalidation Queue | Keeps regions eventually consistent without latency penalty. |