System design for a Global Distributed Cache

1. Functionality (Requirements)

We are designing a distributed key-value store acting as a cache for a database with 100 billion records.

Core Operations: Support get(key) and set(key, value) operations.
TTL Support: Keys must expire after a defined time-to-live.
Eviction: When memory is full, the system must evict less useful data (e.g., LRU).
Global Access: The cache is accessible from multiple geographical regions (assuming the "Global" in the prompt implies multi-region support).

2. Non-Functional Requirements

Latency: Extremely low. Reading from cache should be < 5ms (p99).
Availability: High (99.99%). A cache miss is acceptable, but a cache downtime that takes down the DB (thundering herd) is not.
Scalability: Must handle the dataset derived from 100B records.
Consistency: Eventual consistency is acceptable for a cache, but we must minimize stale data windows.
Observability: Hit/Miss ratios, memory usage, and eviction rates must be visible.

3. Back-of-the-Envelope Estimation (Sizing)

This is the most critical step to justify your architecture.

Total Records: 100 Billion.
Record Size:
Key (UUID): 16 bytes.
Value (JSON/Blob): Assume average 500 bytes.
Metadata (TTL, pointers): 64 bytes.
Total per record: ~600 bytes.
Total Data Volume: $100 \times 10^9 \times 600 \text{ bytes} \approx 60 \text{ TB}$ .
Cache Sizing (The "Hot" Set): Storing 60TB in RAM is expensive. Following the Pareto Principle (80/20 rule), we assume 20% of the keys generate 80% of the traffic.
Target Capacity: $60 \text{ TB} \times 0.20 = 12 \text{ TB}$ .
Hardware:
Assume High-Memory instances (e.g., 256GB RAM).
Usable memory per node (after OS overhead): ~200GB.
Node Count: $12,000 \text{ GB} / 200 \text{ GB} \approx 60 \text{ Nodes}$ .
To handle replication (factor of 3) and growth: We need roughly 180 - 200 nodes.

4. Key Entities

Cache Client: A smart library or sidecar (like Envoy) living in the application service. It handles hashing and routing.
Cache Node: A single server instance holding a shard of the data in RAM.
Cluster Manager: Maintains the health of nodes and the topology (e.g., ZooKeeper or Etcd).

5. API Design

We will use a simple internal RPC API (e.g., gRPC).

service CacheService {
  // Returns value or error "Not Found"
  rpc Get(GetRequest) returns (GetResponse); 
  
  // Returns success or error
  rpc Set(SetRequest) returns (SetResponse); 
}

message SetRequest {
  string key = 1;
  bytes value = 2;
  int32 ttl_seconds = 3;
  bool write_through = 4; // Option for consistency strategy
}

6. High-Level Design

Distributed Cache Workflow & Architecture

We will use a Distributed Hash Table (DHT) architecture using Consistent Hashing.

The Flow:

Request: Service A requests key user_123.
Hash: The Cache Client hashes user_123 (e.g., MD5 or MurmurHash).
Route: The client looks up the Configuration Map (managed by ZooKeeper) to see which Cache Node owns the hash range for user_123.
Execute: The client connects directly to the specific Cache Node (e.g., Node 4) to fetch the data.
Fallback: If the cache returns "Miss", the client fetches from the DB and performs a Set on the cache (Cache-Aside pattern).

7. Dive Deep (Addressing the Specific Issues)

A. Data Distribution & Consistency Hashing

With 200 nodes, simple modulo hashing (Key % N) is dangerous; if one node crashes, N changes, remapping almost all keys (Massive Cache Stampede).

Solution: Consistent Hashing. Nodes are placed on a "Ring". A key is assigned to the first node found moving clockwise on the ring.
Virtual Nodes: To prevent hotspots (where one node gets a disproportionate amount of traffic), each physical node is mapped to multiple points on the ring (e.g., 100 virtual nodes per physical machine). This ensures uniform distribution.

B. Replication & High Availability

If a node crashes, the data on it is lost. We cannot afford to hammer the DB for that 200GB of data.

Leader-Follower: Every partition on the hash ring has 1 Leader and 2 Followers.
Writes: Go to Leader \rightarrow replicated to Followers asynchronously (for speed) or synchronously (for consistency).
Reads: Can go to Leader (strong consistency) or Followers (eventual consistency).
Failover: If a Leader crashes, the Cluster Manager (ZooKeeper) detects the heartbeat loss and promotes a Follower to Leader.

C. Handling "Crashing of Cache Servers"

When a node crashes completely:

Detection: The Cluster Manager detects the node is unreachable via gossip protocol/heartbeats.
Topology Update: The map is updated to remove the failed node.
Request Re-routing: Clients pull the new config. Traffic for that hash range is routed to the next node on the ring (which holds the replica).
Recovery: A new node is spun up. It requests a "state transfer" from its peers to populate its cache.

D. Cache Warm-up

A cold cache is useless and dangerous (DB overload).

Scenario: We are deploying a new region or recovering a massive cluster failure.
Strategy 1: Snapshot/dump: Periodically dump cache snapshots to disk (like Redis RDB). On boot, load the file.
Strategy 2: Population from DB Logs: Analyze DB read logs to identify the "hot keys" and pre-populate the cache before allowing traffic to hit it.

E. TTL & Eviction

How do we actually delete data?

Eviction (Memory Full): Use Approximated LRU. Exact LRU requires massive linked-list manipulation which locks memory. Instead, sample 5 random keys and evict the one with the oldest access time.
TTL (Time Expiration):
Lazy Deletion: When a client tries to get a key, check the timestamp. If expired, return null and delete.
Active Deletion: A background thread randomly samples keys with TTLs and deletes expired ones to free memory.

F. Global Distribution (The "Global" Requirement)

If we have a cluster in US-East and one in EU-West:

Read Local: Clients always read from the local region cache.
Write Propagation:
Option A (Simple): Cache invalidation. If DB updates in US-East, send an invalidation event (via Kafka) to EU-West cache to delete the key.
Option B (Active-Active): Use CRDTs (Conflict-Free Replicated Data Types) if we are caching write-heavy counters, though this is complex for general caching.
Recommendation: Use Cache Invalidation. It is safer to have a cache miss in Europe than to serve stale data indefinitely.

G. The "Thundering Herd" Problem

If a "mega-hot" key (e.g., Justin Bieber's profile) expires, 10,000 requests might hit the cache simultaneously, find it missing, and all hit the database at once.

Solution 1: Probabilistic Early Expiration: If TTL is 60s, fetch the data. If the current time is > 55s, roll a dice. If you "lose", fetch from DB and refresh the cache before it actually expires.
Solution 2: Locking: The first client to miss gets a lock (in the cache or ZK) to populate the data. Others wait or return stale data.

Summary of Design

Component	Technology/Strategy	Why?
Partitioning	Consistent Hashing with V-Nodes	Minimal movement during resizing; uniform load.
Eviction	Approximated LRU	High performance, low memory overhead.
Availability	Master-Slave Replication (x3)	Read scalability and fault tolerance.
Communication	TCP / Protobuf	Low overhead networking.
Discovery	ZooKeeper / Etcd	Managing the ring state and failure detection.
Global Sync	Async Invalidation Queue	Keeps regions eventually consistent without latency penalty.