Caching Demystified - Part 1: Fundamentals
Table of Contents
- What is Caching?
- The Memory Hierarchy - Speed vs Cost vs Capacity
- Why Caching Matters - The Numbers Behind the Decisions
- Core Terminology - The Language of Caching
- Types of Caches - The Full Landscape
- Cache Eviction Policies - What Gets Removed and Why
- Cache Metrics - Measuring What Matters
- Cache Invalidation - The Hardest Problem in Computer Science
- Cache Sizing and Capacity Planning
1. What is Caching?
A cache is a high-speed, temporary storage layer that holds copies of data so that future requests for that same data can be served faster - without going back to the original, slower source.
The core principle behind caching is simple: avoid repeating expensive work.
Analogy 1 - The Kitchen Counter
Imagine a chef in a busy restaurant kitchen:
- The grocery store is 10 minutes away. You go there when you need something rare.
- The pantry stores hundreds of ingredients. Fetching from there takes 2 minutes.
- The kitchen counter holds the 10 most-used ingredients right at hand. Fetching takes 2 seconds.
The kitchen counter is the cache. A smart chef places frequently used items on the counter to make cooking fast. When a new ingredient is needed, they fetch it from the pantry and may place it on the counter if they expect to use it again soon.
When the counter is full and a new item is needed, the chef removes the item they have not used in the longest time. That is an eviction.
Analogy 2 - The Human Brain
When someone asks you "What is 2 + 2?", you do not go through the mental process of counting. Your brain has cached that result. The answer 4 comes instantly.
But if someone asks "What is 7,463 x 9,872?", you do not have that cached. You have to compute it from scratch (a cache miss). That computation is expensive and takes time.
Analogy 3 - The Office Desk
Think of a developer's workspace:
- Desk surface: Papers currently being worked on (fastest, very limited space)
- Filing cabinet in the office: Documents accessed weekly (fast, limited space)
- Central company archive: All documents ever created (slow, unlimited space)
You keep active documents on your desk. When you finish with one, you file it. When you need an archived document, you retrieve it and put it on your desk.
The Technical Picture
Without caching:
Every Request:
User --> Application --> Database (100ms) --> Application --> User
1,000 requests for "product:123":
1,000 x 100ms = 100,000ms total database time
Database receives 1,000 queries for the same data
With caching:
First Request (Cache Miss):
User --> Application --> Cache (miss, 1ms) --> Database (100ms) --> Cache (store) --> User
All Subsequent Requests (Cache Hit):
User --> Application --> Cache (hit, 1ms) --> User
1,000 requests for "product:123" with 99% hit rate:
- 10 database queries: 10 x 100ms = 1,000ms
- 990 cache hits: 990 x 1ms = 990ms
- Total: ~2 seconds vs 100 seconds without cache
2. The Memory Hierarchy
Computers are built on a fundamental trade-off: faster storage is physically smaller and more expensive. This creates a natural hierarchy where data is stored closer to the CPU if it is accessed frequently.
+------------------------------------------------------------------+
| SPEED LAYER SIZE LATENCY |
+------------------------------------------------------------------+
| Fastest CPU Registers < 1 KB < 0.3 ns |
| L1 Cache (per core) 32 - 128 KB ~1 ns |
| L2 Cache (per core) 256 KB - 1 MB ~4 ns |
| L3 Cache (shared) 4 MB - 64 MB ~10-30 ns |
| Main Memory (RAM) GBs ~100 ns |
| NVMe SSD TBs ~100 us |
| SATA SSD TBs ~500 us |
| Hard Disk Drive (HDD) TBs ~10 ms |
| Slowest Network (Remote DB) Unlimited ~100-1000 ms |
+------------------------------------------------------------------+
Latency Numbers Every Engineer Must Know
These numbers, popularized by Jeff Dean at Google, are the foundation of every performance discussion:
Operation Approximate Latency
-----------------------------------------------------------
L1 cache reference 0.5 ns
Branch misprediction 5 ns
L2 cache reference 7 ns
Mutex lock / unlock 25 ns
Main memory reference (RAM) 100 ns
Compress 1 KB with fast compression 3,000 ns (3 us)
Send 1 KB over 1 Gbps network 10,000 ns (10 us)
Read 4 KB randomly from NVMe SSD 150,000 ns (150 us)
Read 1 MB sequentially from RAM 250,000 ns (250 us)
Round trip within same datacenter 500,000 ns (0.5 ms)
Read 1 MB sequentially from SSD 1,000,000 ns (1 ms)
Disk seek (HDD) 10,000,000 ns (10 ms)
Read 1 MB sequentially from HDD 20,000,000 ns (20 ms)
Send packet from US West Coast to Europe 150,000,000 ns (150 ms)
What These Numbers Mean for Caching
If your application fetches from a remote database over a network:
- Average latency: ~1-10 ms (1,000,000 to 10,000,000 ns)
If your application reads from a local in-process cache (RAM):
- Average latency: ~100 ns to ~1 us
The cache is 10,000x to 100,000x faster than the database for a cache hit.
Application to Software Architecture
Every software system has its own "memory hierarchy":
Level Storage Latency Capacity Cost
------ ------- ------- -------- ----
L1 In-process memory cache < 1 ms 100 MB Low
L2 Distributed cache 1-5 ms 100 GB Medium
L3 Database (with indexes) 5-50 ms 1 TB High
L4 Cold storage / archive 100-1000ms Unlimited Very Low
Understanding this hierarchy tells you where to put your cache and why.
3. Why Caching Matters - The Numbers
Scenario: E-Commerce Product Page
A product detail page hits 5 database tables:
- Products table: 20ms
- Inventory table: 15ms
- Pricing/promotions table: 25ms
- Reviews aggregate: 40ms
- Recommendations table: 30ms
- Total uncached: 130ms per request
With caching, all 5 queries are replaced by a single cache lookup: 1ms.
At 10,000 requests per second:
- Uncached: 10,000 x 130ms = requires massive database cluster
- Cached (95% hit rate): 500 x 130ms + 9,500 x 1ms = handles with a much smaller database
The Effective Access Time Formula
EAT (Effective Access Time) quantifies the average time to access data given a specific hit rate:
EAT = (Hit Rate x Cache Latency) + (Miss Rate x (Cache Latency + DB Latency))
Example:
- Cache latency: 1ms
- Database latency: 100ms
- Hit Rate: 90% (0.9), Miss Rate: 10% (0.1)
EAT = (0.9 x 1ms) + (0.1 x (1ms + 100ms))
EAT = 0.9ms + 10.1ms
EAT = 11ms (vs 100ms without cache = 9x improvement)
With 99% hit rate:
EAT = (0.99 x 1ms) + (0.01 x 101ms)
EAT = 0.99ms + 1.01ms
EAT = 2ms (vs 100ms without cache = 50x improvement)
Lesson: Pushing hit rate from 90% to 99% improves performance by another 5x. Every percentage point matters.
When Caching Is the Right Tool
Caching works best when:
- Read-heavy workloads: Data is read far more often than it is written (10:1 or higher read/write ratio).
- Repeated access patterns: The same data is requested by multiple users or requests.
- Expensive origin fetches: Database queries involve joins, aggregations, or full table scans.
- External API calls: Third-party services have rate limits, latency, or cost per call.
- Computationally expensive results: Machine learning inference, complex calculations, report generation.
- Tolerable staleness: The business can accept data being slightly behind (seconds to minutes).
Caching is the wrong tool when:
- Every request accesses unique data (no temporal locality).
- Data changes on every single write and cannot be stale at all.
- The operation is write-heavy with very infrequent reads.
- Compliance requirements prohibit storing data in additional systems.
4. Core Terminology - The Language of Caching
Understanding these terms precisely is essential for both building cache systems and discussing them in interviews.
Cache Hit
A cache hit occurs when the requested data key is found in the cache and the cached value is returned directly.
Request: GET user:123
Cache state: { "user:123": { name: "Alice", role: "admin" } }
Result: Cache HIT - return { name: "Alice", role: "admin" } immediately
Cache Miss
A cache miss occurs when the requested data key is NOT found in the cache. The system must fetch from the origin store.
Request: GET user:456
Cache state: { "user:123": { ... } } -- user:456 not here
Result: Cache MISS - go to database, store in cache, return value
Types of Cache Misses
Cold Miss (Compulsory Miss):
- The data is being accessed for the very first time ever.
- No cache can prevent this. It is unavoidable.
- Example: Application just started, cache is empty.
Capacity Miss:
- The cache had the data at some point but evicted it because the cache was full.
- Solution: Increase cache capacity or improve eviction policy.
Conflict Miss (CPU caches only):
- In direct-mapped or N-way set-associative CPU caches, two memory addresses may hash to the same cache slot, forcing eviction even when the cache has logical space.
- Not applicable to application-level caches (which are fully associative).
Cache Hit Ratio (Hit Rate)
The percentage of total cache requests that result in a hit:
Hit Ratio = Cache Hits / (Cache Hits + Cache Misses) x 100%
Example: 9,500 hits and 500 misses out of 10,000 requests:
Hit Ratio = 9,500 / 10,000 = 95%
What Constitutes a Good Hit Ratio?
< 70% - Cache may not be helping much. Check access patterns and TTL.
70 - 85% - Acceptable for variable/user-specific workloads.
85 - 95% - Good. Typical for well-configured production systems.
95 - 99% - Excellent. Target for read-heavy, repetitive workloads.
> 99% - Outstanding. Common for static content and reference data.
Time to Live (TTL)
TTL is the maximum duration a cache entry is considered valid. After the TTL elapses, the entry expires and must be refreshed from the origin.
cache.set("product:789", productData, ttl=300) // 300 seconds = 5 minutes
After 300 seconds, any request for "product:789" results in a cache miss.
Absolute TTL: Entry expires at a fixed time from when it was created, regardless of access.
Sliding TTL (Idle TTL): The TTL is reset every time the entry is accessed. Keeps hot data alive indefinitely. Can cause stale data to persist if the entry is accessed frequently but the source changes.
Cache Eviction
Eviction is the automatic removal of cache entries to make room for new ones when the cache reaches its capacity limit. Eviction is driven by capacity constraints and is controlled by the eviction policy.
Eviction is about space management, not about data correctness.
Cache Invalidation
Invalidation is the deliberate removal or marking as stale of a specific cache entry because the underlying source data has changed. Invalidation is about data correctness, not space.
User updates their profile photo:
1. Database is updated with new photo URL.
2. Cache entry for "user:123:profile" is INVALIDATED (deleted).
3. Next read for user:123 goes to DB and re-populates the cache.
The famous quote from Phil Karlton: "There are only two hard things in Computer Science: cache invalidation and naming things."
Stale Data
A cache entry is stale when the source data has been updated but the cached copy has not been refreshed. Stale data represents a form of cache-origin inconsistency.
Staleness is acceptable when:
- The data does not change often (reference data, product catalog).
- The business can tolerate reading slightly outdated values.
- The TTL is short enough that the stale window is acceptable.
Staleness is NOT acceptable when:
- Financial data (account balance, transaction records).
- Inventory levels (stock availability).
- Any data that drives critical business decisions requiring real-time accuracy.
Cache Warm vs Cache Cold
Warm Cache: The cache contains the anticipated working set. Cache hit rates are high. The system performs well.
Cold Cache: The cache is empty or contains irrelevant data. Cache miss rates are high. Every request goes to the database. This happens:
- When the application first starts.
- After a cache flush.
- After a cache server restart.
- After a rolling deployment.
The transition from cold to warm is called cache warming.
Working Set
The working set is the subset of data that is actively accessed during a given time window. In most real-world systems, 20% of the data accounts for 80% of the reads (Pareto principle / power-law distribution).
Caching only the working set provides most of the benefit at a fraction of the storage cost.
Cache Stampede (Thundering Herd)
When a popular cache entry expires, many concurrent requests all simultaneously experience a cache miss and all simultaneously query the database. This creates a sudden spike in database load. This problem is covered in depth in Part 4.
Cache Coherence
In systems with multiple cache instances (e.g., multiple app servers each with a local in-memory cache), coherence is the guarantee that all instances see consistent data.
If Server A updates data and only invalidates its own local cache, Server B still serves the old value. This is a coherence violation.
Cache Penetration
Requesting data that does not exist in either the cache or the database, causing every request to pass through to the database. Covered in depth in Part 4.
Cache Avalanche
When a large number of cache entries expire simultaneously, causing a flood of cache misses and a corresponding spike in database queries. Covered in depth in Part 4.
Write-Allocate vs No-Write-Allocate
Write-Allocate: On a write to a key that is NOT in the cache, the data is loaded into the cache before being updated. Keeps the cache in sync with writes.
No-Write-Allocate: On a write to a key that is NOT in the cache, the data is written directly to the backing store without loading it into cache first.
5. Types of Caches - The Full Landscape
Caches exist at every layer of the computing stack. Knowing the full landscape helps you choose the right cache for each problem.
5.1 Hardware Caches (CPU L1, L2, L3)
Built directly into the processor chip. Managed entirely by the CPU hardware and are invisible to application code.
+---------------------------+
| CPU Chip |
| Core 0 Core 1 |
| [L1 I$] [L1 I$] | L1 Instruction Cache (per core)
| [L1 D$] [L1 D$] | L1 Data Cache (per core)
| [L2 ] [L2 ] | L2 Cache (per core, combined)
| +------------------+ |
| | L3 Cache | | L3 Cache (shared across all cores)
| +------------------+ |
+---------------------------+
- L1: ~32-128 KB per core, ~1 ns latency. Split into instruction cache and data cache.
- L2: ~256 KB - 1 MB per core, ~4 ns latency. Combined instructions and data.
- L3: 4 MB - 64 MB shared, ~10-30 ns latency. Shared across all cores on the chip.
CPU caches work on cache lines (typically 64 bytes). Accessing one byte loads an entire 64-byte cache line. This is why iterating over an array sequentially is orders of magnitude faster than random access through a linked list.
Cache-friendly programming:
- Use arrays instead of linked lists for sequential access.
- Access 2D arrays row-by-row, not column-by-column.
- Keep hot data structures small (fits in L1/L2).
5.2 Operating System Page Cache
The OS kernel maintains a page cache (also called buffer cache) in RAM that caches disk pages that have been recently read or written. All file I/O goes through the page cache.
When your database reads a data file, the OS caches those disk pages. Subsequent reads of the same pages come from RAM, not disk. This is why database performance often improves over time after a cold start.
Tools to observe:
- Linux:
free -hshows buff/cache column - Linux:
vmstat -sshows page cache statistics
5.3 Browser Cache
The web browser stores static assets (HTML, CSS, JavaScript, images, fonts) on the local filesystem. Subsequent visits to the same site serve assets from local disk instead of re-downloading.
Cache behavior is controlled by HTTP response headers:
Cache-Control: max-age=31536000- cache for 1 yearETag: "abc123"- version identifier for conditional requestsLast-Modified: Wed, 05 Jun 2026 10:00:00 GMT- timestamp for conditional requests
5.4 DNS Cache
Every DNS resolution (converting hostname to IP address) involves network round trips to DNS servers. DNS caches store resolved hostnames locally at multiple levels:
- Browser DNS cache (typically 1 minute)
- OS DNS resolver cache (follows record TTL)
- Router DNS cache
- ISP recursive resolver cache
When you update a DNS record, the change propagates slowly because old entries are cached across this hierarchy until their TTL expires. This is why "DNS propagation" can take up to 48 hours in the worst case.
5.5 Application-Level In-Process Cache
A cache that lives inside the application process itself, using the application's own memory (JVM heap for Java, process heap for Python, etc.).
Examples:
- Java: Caffeine, Guava Cache, Ehcache (in-process mode)
- Python: functools.lru_cache, cachetools
- .NET: IMemoryCache
Characteristics:
- Extremely fast: no network overhead, sub-millisecond access
- Private: data is NOT shared between different application instances
- Volatile: cleared when the application restarts
- Bounded by application server memory
- Risk: Too much data cached in-process causes JVM Garbage Collection pauses or OOM
Best use cases:
- Reference data (country codes, currency codes, configuration constants)
- User session data for a single-server application
- Results of expensive computations that are the same for all users
- Lookup tables, dictionaries, static mappings
5.6 Distributed / Remote Cache
A cache that runs as a separate service or cluster, accessible by all application instances over the network.
Examples: Redis, Memcached, Hazelcast, Apache Ignite, Couchbase
Characteristics:
- Network overhead: typically 0.5ms - 2ms on a local network (much slower than in-process)
- Shared: all application instances see the same data (solves coherence problem)
- Durable: data survives application restarts (Redis with persistence)
- Horizontally scalable: add more cache nodes as data grows
- Independent: scales and deploys independently of the application
Best use cases:
- User sessions in a multi-instance deployment
- Shared application state (rate limiting counters, distributed locks)
- Results of expensive database queries shared across all users
- Inter-service caching in microservices
5.7 CDN (Content Delivery Network) Cache
A globally distributed network of edge servers that cache content physically close to end users, reducing latency by serving from a nearby geographic location.
User in Tokyo User in London
| |
v v
[CDN Edge - Asia] [CDN Edge - Europe]
| Cache HIT | Cache MISS
| (served locally) | |
v v v
User gets response fast [Origin Server - US West]
(~5ms) (100ms+ from London)
Cache hierarchy:
- Browser cache (client-side)
- CDN edge cache (geographically closest server)
- CDN origin shield (intermediate CDN layer, regional)
- Origin server (your application server)
CDN caches respect HTTP Cache-Control headers and support cache purging via API.
Examples: Cloudflare, AWS CloudFront, Akamai, Fastly, Google Cloud CDN
5.8 Database Internal Caches
Databases maintain their own internal caching mechanisms:
InnoDB Buffer Pool (MySQL):
- The primary MySQL performance tuning parameter.
- Caches frequently accessed data pages and index pages in RAM.
- Typically configured to 70-80% of available RAM on a dedicated database server.
- Most MySQL read queries are served from the buffer pool.
SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests'- total reads attemptedSHOW STATUS LIKE 'Innodb_buffer_pool_reads'- reads that required disk I/O- Buffer pool hit rate = 1 - (reads / read_requests), should be > 99%
PostgreSQL shared_buffers:
- Similar to InnoDB Buffer Pool.
- Recommended: 25% of available RAM.
- Works in conjunction with the OS page cache.
MySQL Query Cache (deprecated in 8.0, removed in 8.0.3):
- Cached the complete result set of SELECT queries by the query text.
- Was a global mutex bottleneck under high concurrency.
- Deprecated because it caused more problems than it solved at scale.
- Lesson: even major databases make caching mistakes.
Materialized Views:
- Pre-computed query results stored as actual database tables.
- Must be refreshed periodically or on data change.
- Extremely effective for heavy aggregation queries (dashboards, reports).
5.9 ORM-Level Cache (Hibernate / JPA)
First-Level Cache (Session Cache):
- Attached to the Hibernate Session (scoped to a single transaction or unit of work).
- Automatically deduplicates queries within a transaction.
- Example: Calling
session.get(User.class, 1L)twice in one transaction only hits the DB once. - Always on, cannot be disabled.
Second-Level Cache (SessionFactory Cache):
- Shared across all sessions.
- Requires explicit configuration.
- Cache providers: Ehcache, Hazelcast, Infinispan, Redis via Redisson.
- Configured per entity:
@Cache(usage = CacheConcurrencyStrategy.READ_WRITE).
Query Cache:
- Caches the list of entity IDs returned by a query.
- Works in conjunction with the second-level cache.
- Invalidated on any change to the queried entity type.
5.10 Web Server Cache
Reverse proxy servers like Nginx and Varnish can cache HTTP responses:
# Nginx proxy cache configuration
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=STATIC:10m inactive=24h max_size=1g;
server {
location /api/ {
proxy_cache STATIC;
proxy_cache_valid 200 5m;
proxy_cache_key "$scheme$request_method$host$request_uri";
}
}Varnish is purpose-built for HTTP caching and supports complex cache invalidation rules (cache tags, regex-based purging).
6. Cache Eviction Policies - What Gets Removed and Why
When a cache is full and a new entry must be stored, the eviction policy determines which existing entry is removed. The goal is to keep entries that are most likely to be requested again.
6.1 LRU - Least Recently Used
The most widely used eviction policy in production systems.
Core idea: Evict the entry that was last accessed (read or written) the longest time ago. The assumption: recently accessed data is likely to be accessed again soon (temporal locality).
Implementation: A doubly linked list combined with a hash map:
- Hash map: O(1) lookup by key
- Doubly linked list: maintains access order
- Most recently accessed entry: head of list
- Least recently accessed entry: tail of list
- On cache hit: move the accessed node to the head
- On eviction: remove the node at the tail
Cache capacity: 4 entries. Operations: A, B, C, D, E, access B again, F
After A: [A]
After B: [B, A]
After C: [C, B, A]
After D: [D, C, B, A] <- Cache full
Request E (miss): Evict A (tail), add E -> [E, D, C, B]
Access B: Move B to head -> [B, E, D, C]
Request F (miss): Evict C (tail), add F -> [F, B, E, D]
Time complexity: O(1) for both get and put with the doubly linked list + hash map combination.
Drawback - Cache Pollution / Scan Resistance:
A batch job that scans every database record at 2 AM floods the LRU cache, evicting all the hot web app data. The next morning, users experience high miss rates until the cache warms back up. This is called cache pollution.
Solution: Use LRU-K (only promote entries to cache after K accesses), or use a segmented LRU.
6.2 LFU - Least Frequently Used
Core idea: Evict the entry with the lowest access count. Assumes that data accessed frequently will continue to be frequently accessed.
Implementation:
- Maintain a frequency map: key -> access count
- Maintain a frequency list: count -> list of keys with that count
- Track the minimum frequency for O(1) eviction
Example:
Keys and access counts:
user:123 -> accessed 47 times
product:A -> accessed 3 times
config:v2 -> accessed 102 times
session:X -> accessed 1 time
Evict: session:X (lowest frequency = 1)
Advantage over LRU: Resistant to cache pollution from sequential scans. A batch job that touches each record once will not evict items with high frequency counts.
Drawbacks:
- Frequency bias: Data that was popular months ago but is no longer relevant keeps its high count and blocks new hot data. This is the "cache aging" problem.
- More complex implementation than LRU.
Solution to frequency bias: Use time-decayed frequency (TinyLFU algorithm) where old access counts are gradually reduced.
6.3 FIFO - First In, First Out
Core idea: Evict the entry that has been in the cache the longest, regardless of access patterns. A simple queue.
Example:
Cache capacity: 3
Add A, B, C: Queue = [A(front), B, C(back)]
Add D: Evict A -> Queue = [B(front), C, D(back)]
Access B: (no position change in FIFO)
Add E: Evict B -> Queue = [C(front), D, E(back)]
Advantage: Extremely simple to implement. O(1) operations.
Disadvantage: Does not consider access frequency or recency. Frequently accessed entries get evicted just because they were added early. Generally inferior to LRU for real-world workloads.
6.4 MRU - Most Recently Used
Core idea: The exact opposite of LRU. Evict the entry that was accessed most recently.
This seems counterintuitive, but it is optimal for specific access patterns:
Use case 1 - Video streaming: Users watch a video once and rarely re-watch it immediately. Keeping recently watched videos in cache wastes space; older favorites are more likely to be re-watched.
Use case 2 - Database scan operations: In certain join algorithms, the most recently accessed page is least likely to be needed again. MRU outperforms LRU for these sequential scan workloads.
6.5 Random Replacement
Core idea: Evict a randomly chosen cache entry.
Advantages:
- Trivial to implement
- Zero overhead (no need to track access time or frequency)
- Avoids pathological worst-case behavior of deterministic algorithms
- Surprisingly good performance for uniform access patterns
Disadvantage: No intelligence - may evict a very hot entry by chance.
6.6 ARC - Adaptive Replacement Cache
Developed by IBM Research. ARC is a sophisticated self-tuning algorithm that combines LRU and LFU.
How it works: ARC maintains four lists of equal total capacity:
- T1: Recently accessed data (cache entries accessed exactly once recently)
- T2: Frequently accessed data (cache entries accessed more than once)
- B1: Ghost entries (metadata only, no data) of recently evicted T1 entries
- B2: Ghost entries of recently evicted T2 entries
The adaptive behavior:
- Cache hit in B1 (was recently evicted from T1): Increase T1's target size (recency was more valuable)
- Cache hit in B2 (was recently evicted from T2): Increase T2's target size (frequency was more valuable)
- ARC continuously adjusts the balance between T1 and T2 based on observed workload
Advantage: Self-tuning without any manual configuration. Adapts to both recency-biased and frequency-biased workloads automatically.
6.7 Redis Eviction Policies
Redis exposes 8 configurable eviction policies via the maxmemory-policy configuration:
Policy Eviction Pool Algorithm
noeviction None Return error when memory is full (default)
allkeys-lru All keys Approximate LRU (5 random samples)
volatile-lru Keys with TTL only Approximate LRU (5 random samples)
allkeys-lfu All keys Approximate LFU
volatile-lfu Keys with TTL only Approximate LFU
allkeys-random All keys Random eviction
volatile-random Keys with TTL only Random eviction
volatile-ttl Keys with TTL only Evict key with smallest TTL first
Notes:
- Redis uses an approximate LRU/LFU algorithm, not a perfect one. It samples 5 random keys (configurable via
maxmemory-samples) and evicts the best candidate from the sample. This avoids maintaining a sorted data structure for all keys. - noeviction is the correct policy when Redis is used as a primary database (not a cache).
- allkeys-lru is the most common choice for a pure caching deployment.
- volatile-lru is appropriate when Redis serves as both cache and persistent store. Keys without TTL (persistent) are never evicted.
6.8 LRU Implementation in Java
import java.util.LinkedHashMap;
import java.util.Map;
public class LRUCache<K, V> extends LinkedHashMap<K, V> {
private final int capacity;
public LRUCache(int capacity) {
// accessOrder=true: iteration order is access order (LRU at tail)
super(capacity, 0.75f, true);
this.capacity = capacity;
}
@Override
protected boolean removeEldestEntry(Map.Entry<K, V> eldest) {
// Remove the eldest (least recently used) when over capacity
return size() > capacity;
}
}
// Usage
LRUCache<String, String> cache = new LRUCache<>(100);
cache.put("user:123", "Alice");
cache.get("user:123"); // marks user:123 as most recently used6.9 Eviction Policy Selection Guide
Workload Pattern Recommended Policy
-----------------------------------------
General web app (mixed reads) LRU (allkeys-lru in Redis)
Skewed access (hot keys stay hot) LFU (allkeys-lfu in Redis)
Scan-resistant needed LFU or ARC
Only expire keys with TTL volatile-lru or volatile-lfu
Streaming / scan workloads MRU
Need simplicity + no overhead Random
Zero tolerance for data loss noeviction (Redis as DB)
7. Cache Metrics - Measuring What Matters
You cannot optimize what you cannot measure. Monitor these metrics for any cache.
Primary Metrics
Hit Rate (most important):
Hit Rate = Hits / (Hits + Misses) x 100%
Monitor this continuously. A declining hit rate is an early warning signal.
Miss Rate:
Miss Rate = 1 - Hit Rate = Misses / (Hits + Misses) x 100%
Eviction Rate:
Number of entries evicted per second. High eviction rates mean the cache is too small for the working set. Increasing cache size will improve the hit rate.
Latency (p50, p95, p99):
Cache access should be very fast. Use percentile metrics, not averages:
- p50 (median): Typical experience
- p95: Most users' experience
- p99: Worst experience affecting 1% of requests
If p99 latency is high, investigate serialization overhead, network issues, or lock contention.
Memory Utilization:
Memory Utilization = Used Memory / Total Allocated Memory x 100%
Near 100% utilization with high eviction rates = cache is undersized. Very low utilization = cache may be over-allocated.
Key Expiration Rate:
How many keys expire per second. A sudden spike in expirations can trigger a cache avalanche (see Part 4).
Redis-Specific Metrics to Monitor
redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys|expired_keys"
keyspace_hits: 9500000
keyspace_misses: 500000
evicted_keys: 1234
expired_keys: 45678
Derived metrics:
- Hit Rate: 9500000 / (9500000 + 500000) = 95%
- Eviction rate per minute: evicted_keys / uptime_in_seconds x 60
redis-cli INFO memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"
used_memory_human: 3.50G
maxmemory_human: 4.00G (87.5% utilized)
mem_fragmentation_ratio: 1.2 (acceptable, < 1.5 is fine)
8. Cache Invalidation - The Hardest Problem in Computer Science
Phil Karlton's famous quote exists because cache invalidation appears conceptually simple but has subtle, hard-to-avoid failure modes in real distributed systems.
Why It Is Hard
Consider this deceptively simple scenario:
Thread A: Thread B:
1. Read user:123 from DB
2. Update user:123 in DB
3. Delete cache key user:123
4. Write user:123 to cache (stale value!)
Thread A reads the OLD value from the database, then Thread B deletes the cache entry (correctly), but Thread A then writes the old value into the cache. The cache now contains stale data.
This race condition between reads and writes is the fundamental challenge of cache invalidation.
Invalidation Strategies
Strategy 1: TTL-Based Invalidation (Passive Expiration)
Every cache entry carries a TTL. When the TTL expires, the entry is automatically removed. The next request fetches fresh data.
cache.set("user:123", userData, ttl=300) // expires in 5 minutes
Pros:
- Simple to implement
- Self-healing: stale data automatically expires
- No need to detect or react to data changes
Cons:
- Data can be stale for up to the full TTL duration
- Too-short TTL degrades hit rate; too-long TTL allows stale data
Best for: Data that changes infrequently and can tolerate staleness:
- Product catalog
- Configuration data
- Country/city lookup tables
- User profile (non-critical fields)
Strategy 2: Event-Driven Invalidation (Active Invalidation)
When data changes in the source system, an explicit invalidation event is triggered to remove or update the cache entry.
Sub-strategy 2a: Write-Invalidate (Delete on Write)
public void updateUser(User user) {
userRepository.save(user); // 1. Update database
cacheManager.evict("users", user.getId()); // 2. Delete cache entry
// Next read will re-populate cache from DB
}Sub-strategy 2b: Write-Update (Update on Write)
public void updateUser(User user) {
userRepository.save(user); // 1. Update database
cacheManager.put("users", user.getId(), user); // 2. Update cache with new value
}Why Write-Invalidate is safer than Write-Update:
Write-Update has a race condition:
Time: Thread A (update) Thread B (concurrent update)
T1: DB write (version 2)
T2: DB write (version 3)
T3: Cache update (version 3) <- correct
T4: Cache update (version 2) <- STALE! Thread A writes old value last
Write-Invalidate avoids this: deleting the entry forces any subsequent reader to fetch from DB, which will always return the latest value.
Cons of event-driven invalidation:
- More complex to implement
- Invalidation event might fail (cache entry remains stale)
- Requires tight coupling between write path and cache layer
Strategy 3: Tag-Based Invalidation
Cache entries are associated with tags representing related entities. When an entity changes, all entries with that tag are invalidated atomically.
// Store product data with tags
cache.set("product:list:page:1", productList,
tags=["category:electronics", "brand:apple"])
cache.set("product:789:detail", product,
tags=["product:789", "category:electronics"])
// When Apple updates pricing, invalidate all Apple-tagged entries
cache.invalidateByTag("brand:apple")
// This removes all cache entries tagged with brand:apple
Implementations:
- Varnish cache (Surrogate-Keys / xkey module)
- Drupal cache tags
- Custom implementation with Redis SETS (store keys per tag, delete all)
Strategy 4: Change Data Capture (CDC)
A database-level trigger mechanism that captures every change to the database and publishes it as an event stream. The cache layer subscribes to this stream and invalidates/updates entries.
Technology stack:
- Debezium: captures MySQL/PostgreSQL/MongoDB changes as Kafka events
- Kafka: event stream
- Application/service: subscribes to Kafka, invalidates cache
This is the most robust but also most complex approach. It guarantees the cache is updated even for changes made directly to the database (bypassing the application).
The Double-Write Race Condition
Thread A (Reader): Thread B (Writer):
1. Cache miss for user:123
2. Update user:123 in DB (new value)
3. Delete cache key user:123
4. Read user:123 from DB (old value!) <-- reads BEFORE Thread B's DB update commits
5. Write OLD value to cache <-- stale data persists until TTL expires
Best practice to minimize this race condition:
- Always use Write-Invalidate (delete), never Write-Update.
- Use a very short TTL as a safety net (even 5 seconds catches most races).
- Use distributed locks around the cache population logic (see Cache Stampede solutions in Part 4).
- For critical data: use database read-your-writes consistency for the invalidating transaction itself.
Distributed Invalidation in Microservices
When Service A owns the data and Service B caches it, invalidation across services is challenging:
Approach 1: Shared cache with event publication
Service A writes data --> publishes "user.updated" event to message bus
Service B subscribes to "user.updated" --> invalidates its cache entry
Approach 2: Short TTLs with eventual consistency
Accept that Service B's cache may be stale for up to N seconds (the TTL). After TTL, data self-heals. Simple, but data accuracy is bounded by TTL.
Approach 3: Versioned cache keys
// Version is stored in a central registry or passed in the request
String cacheKey = "user:123:v" + userVersion;
// When user is updated, version increments, old cache key is automatically "orphaned"
// Old entries eventually evict (but can waste memory until then)
9. Cache Sizing and Capacity Planning
The Working Set Principle
You do not need to cache everything. The working set is the subset of data that drives the vast majority of traffic. Cache the working set and you capture most of the value.
Pareto principle applied to caches:
- 80% of requests access 20% of data
- Caching that 20% gives you 80% hit rate
- For high-traffic systems, you may achieve 90%+ hit rates by caching 10-15% of total data
Estimating Required Cache Size
Cache Size = (Number of hot objects) x (Average object size in bytes)
+ (Overhead per key: ~64-100 bytes in Redis)
+ (25-30% buffer for metadata and fragmentation)
Worked example:
- 50 million registered users, 10% active (working set): 5 million user objects
- Average serialized user object: 800 bytes
- Base estimate: 5,000,000 x 800 bytes = 4 GB
- Key overhead: 5,000,000 x 80 bytes = 400 MB
- 30% buffer: ~1.3 GB
- Total recommended cache size: ~5.7 GB, allocate 6 GB
Monitoring and Adjusting Cache Size
Cache sizing is empirical. Monitor, measure, and adjust:
Observation Action
----------------------- ------
High eviction rate + Low hit rate Increase cache size
High eviction rate + Acceptable hit rate Acceptable (working set > cache, but hot data stays)
Near-zero eviction + Very high memory usage Consider reducing allocated memory
Sudden hit rate drop Check if working set has changed or TTL too short
Hit rate gradually declining Data access patterns changed, re-evaluate TTL and keys
Cost-Benefit Framework for Cache Sizing
Monthly DB Cost (queries/sec x DB compute cost) vs Monthly Cache Cost (RAM x price)
Example:
- Without cache: Database cluster handles 50,000 queries/sec
Cost: $15,000/month
- With 10 GB Redis cache at 95% hit rate: Database handles 2,500 queries/sec
Database cost drops to: $2,000/month
Redis cost (10 GB): $300/month
Total: $2,300/month
Net savings: $12,700/month
Key Design for Effective Caching
Cache keys should be:
- Unique: No two different data items should share the same key
- Descriptive: Clear what data the key refers to
- Namespaced: Prefixed to avoid collisions across different services or data types
- Predictable: Deterministic from the input parameters
Best practice patterns:
{service}:{entity}:{id}
{service}:{entity}:{id}:{version}
{service}:{entity}:{query_param1}:{query_param2}
Examples:
users-service:user:123
catalog-service:product:456
catalog-service:products:category:electronics:page:2
auth-service:session:abc123def456
Anti-patterns:
user_123 // No namespace, collision risk
user:123:full:data:v2 // Too verbose
u123 // Too cryptic
SELECT * FROM users... // Full query text as key (dangerous with user input)
Summary
Caching is not a magic solution but a powerful engineering tool when applied thoughtfully:
-
Understand your data access patterns before adding a cache. Caching uniform or constantly changing data yields little benefit.
-
Know the memory hierarchy. The further from the CPU, the more expensive the access. Caching brings data closer to the access point.
-
Measure first. Use hit rate, eviction rate, and latency metrics to validate that caching is working and correctly sized.
-
Choose the right eviction policy. LRU is the default right choice for most web applications. Use LFU for highly skewed access patterns.
-
Cache invalidation is the hardest part. Use Write-Invalidate (not Write-Update). Use TTL as a safety net. Accept eventual consistency unless strong consistency is a hard requirement.
-
Size the cache to fit the working set, not the full dataset. The Pareto principle usually applies: caching 20% of data captures 80% of reads.