Chengchang Yu
Published on

System design for a Global Distributed Cache

Authors

1. Functionality (Requirements)

We are designing a distributed key-value store acting as a cache for a database with 100 billion records.

  • Core Operations: Support get(key) and set(key, value) operations.
  • TTL Support: Keys must expire after a defined time-to-live.
  • Eviction: When memory is full, the system must evict less useful data (e.g., LRU).
  • Global Access: The cache is accessible from multiple geographical regions (assuming the "Global" in the prompt implies multi-region support).

2. Non-Functional Requirements

  • Latency: Extremely low. Reading from cache should be < 5ms (p99).
  • Availability: High (99.99%). A cache miss is acceptable, but a cache downtime that takes down the DB (thundering herd) is not.
  • Scalability: Must handle the dataset derived from 100B records.
  • Consistency: Eventual consistency is acceptable for a cache, but we must minimize stale data windows.
  • Observability: Hit/Miss ratios, memory usage, and eviction rates must be visible.

3. Back-of-the-Envelope Estimation (Sizing)

This is the most critical step to justify your architecture.

  • Total Records: 100 Billion.

  • Record Size:

  • Key (UUID): 16 bytes.

  • Value (JSON/Blob): Assume average 500 bytes.

  • Metadata (TTL, pointers): 64 bytes.

  • Total per record: ~600 bytes.

  • Total Data Volume: 100×109×600 bytes60 TB100 \times 10^9 \times 600 \text{ bytes} \approx 60 \text{ TB}.

  • Cache Sizing (The "Hot" Set): Storing 60TB in RAM is expensive. Following the Pareto Principle (80/20 rule), we assume 20% of the keys generate 80% of the traffic.

  • Target Capacity: 60 TB×0.20=12 TB60 \text{ TB} \times 0.20 = 12 \text{ TB}.

  • Hardware:

  • Assume High-Memory instances (e.g., 256GB RAM).

  • Usable memory per node (after OS overhead): ~200GB.

  • Node Count: 12,000 GB/200 GB60 Nodes12,000 \text{ GB} / 200 \text{ GB} \approx 60 \text{ Nodes}.

  • To handle replication (factor of 3) and growth: We need roughly 180 - 200 nodes.


4. Key Entities

  1. Cache Client: A smart library or sidecar (like Envoy) living in the application service. It handles hashing and routing.
  2. Cache Node: A single server instance holding a shard of the data in RAM.
  3. Cluster Manager: Maintains the health of nodes and the topology (e.g., ZooKeeper or Etcd).

5. API Design

We will use a simple internal RPC API (e.g., gRPC).

service CacheService {
  // Returns value or error "Not Found"
  rpc Get(GetRequest) returns (GetResponse); 
  
  // Returns success or error
  rpc Set(SetRequest) returns (SetResponse); 
}

message SetRequest {
  string key = 1;
  bytes value = 2;
  int32 ttl_seconds = 3;
  bool write_through = 4; // Option for consistency strategy
}


6. High-Level Design

Distributed Cache Workflow & Architecture

Distributed Cache Workflow & Architecture

We will use a Distributed Hash Table (DHT) architecture using Consistent Hashing.

The Flow:

  1. Request: Service A requests key user_123.
  2. Hash: The Cache Client hashes user_123 (e.g., MD5 or MurmurHash).
  3. Route: The client looks up the Configuration Map (managed by ZooKeeper) to see which Cache Node owns the hash range for user_123.
  4. Execute: The client connects directly to the specific Cache Node (e.g., Node 4) to fetch the data.
  5. Fallback: If the cache returns "Miss", the client fetches from the DB and performs a Set on the cache (Cache-Aside pattern).

7. Dive Deep (Addressing the Specific Issues)


A. Data Distribution & Consistency Hashing

With 200 nodes, simple modulo hashing (Key % N) is dangerous; if one node crashes, N changes, remapping almost all keys (Massive Cache Stampede).

  • Solution: Consistent Hashing. Nodes are placed on a "Ring". A key is assigned to the first node found moving clockwise on the ring.
  • Virtual Nodes: To prevent hotspots (where one node gets a disproportionate amount of traffic), each physical node is mapped to multiple points on the ring (e.g., 100 virtual nodes per physical machine). This ensures uniform distribution.

B. Replication & High Availability

If a node crashes, the data on it is lost. We cannot afford to hammer the DB for that 200GB of data.

  • Leader-Follower: Every partition on the hash ring has 1 Leader and 2 Followers.

  • Writes: Go to Leader \rightarrow replicated to Followers asynchronously (for speed) or synchronously (for consistency).

  • Reads: Can go to Leader (strong consistency) or Followers (eventual consistency).

  • Failover: If a Leader crashes, the Cluster Manager (ZooKeeper) detects the heartbeat loss and promotes a Follower to Leader.


C. Handling "Crashing of Cache Servers"

When a node crashes completely:

  1. Detection: The Cluster Manager detects the node is unreachable via gossip protocol/heartbeats.
  2. Topology Update: The map is updated to remove the failed node.
  3. Request Re-routing: Clients pull the new config. Traffic for that hash range is routed to the next node on the ring (which holds the replica).
  4. Recovery: A new node is spun up. It requests a "state transfer" from its peers to populate its cache.

D. Cache Warm-up

A cold cache is useless and dangerous (DB overload).

  • Scenario: We are deploying a new region or recovering a massive cluster failure.
  • Strategy 1: Snapshot/dump: Periodically dump cache snapshots to disk (like Redis RDB). On boot, load the file.
  • Strategy 2: Population from DB Logs: Analyze DB read logs to identify the "hot keys" and pre-populate the cache before allowing traffic to hit it.

E. TTL & Eviction

How do we actually delete data?

  • Eviction (Memory Full): Use Approximated LRU. Exact LRU requires massive linked-list manipulation which locks memory. Instead, sample 5 random keys and evict the one with the oldest access time.
  • TTL (Time Expiration):
  • Lazy Deletion: When a client tries to get a key, check the timestamp. If expired, return null and delete.
  • Active Deletion: A background thread randomly samples keys with TTLs and deletes expired ones to free memory.

F. Global Distribution (The "Global" Requirement)

If we have a cluster in US-East and one in EU-West:

  • Read Local: Clients always read from the local region cache.
  • Write Propagation:
  • Option A (Simple): Cache invalidation. If DB updates in US-East, send an invalidation event (via Kafka) to EU-West cache to delete the key.
  • Option B (Active-Active): Use CRDTs (Conflict-Free Replicated Data Types) if we are caching write-heavy counters, though this is complex for general caching.
  • Recommendation: Use Cache Invalidation. It is safer to have a cache miss in Europe than to serve stale data indefinitely.

G. The "Thundering Herd" Problem

If a "mega-hot" key (e.g., Justin Bieber's profile) expires, 10,000 requests might hit the cache simultaneously, find it missing, and all hit the database at once.

  • Solution 1: Probabilistic Early Expiration: If TTL is 60s, fetch the data. If the current time is > 55s, roll a dice. If you "lose", fetch from DB and refresh the cache before it actually expires.
  • Solution 2: Locking: The first client to miss gets a lock (in the cache or ZK) to populate the data. Others wait or return stale data.

Summary of Design

ComponentTechnology/StrategyWhy?
PartitioningConsistent Hashing with V-NodesMinimal movement during resizing; uniform load.
EvictionApproximated LRUHigh performance, low memory overhead.
AvailabilityMaster-Slave Replication (x3)Read scalability and fault tolerance.
CommunicationTCP / ProtobufLow overhead networking.
DiscoveryZooKeeper / EtcdManaging the ring state and failure detection.
Global SyncAsync Invalidation QueueKeeps regions eventually consistent without latency penalty.