← Back to Articles
6/6/2026Admin Post

sharding interview questions

Series: Sharding Demystified

Sharding Interview Questions - Complete Preparation Guide

From Junior Engineer to Principal Architect Level


Table of Contents

  1. Level 1 - Foundational (Every engineer must know)
  2. Level 2 - Intermediate (Mid-level engineers)
  3. Level 3 - Advanced (Senior engineers)
  4. Level 4 - Expert and Architect Level
  5. Tricky Questions That Candidates Get Wrong
  6. Scenario-Based Design Problems
  7. What Interviewers Actually Look For
  8. Common Mistakes to Avoid

Level 1 - Foundational

Expected at: Junior to Mid-level engineers. These questions test basic understanding.


Q1. What is database sharding and why would you use it?

What a poor answer sounds like:
"Sharding splits a big database into smaller databases."

What an excellent answer sounds like:

Sharding is the practice of horizontally partitioning a dataset across multiple independent database instances, called shards, where each shard holds a distinct, non-overlapping subset of the data. Together, all shards contain the complete dataset.

You would use sharding when:

  1. A single database server cannot hold all your data (disk limit)
  2. A single database server cannot handle your write throughput (CPU/IO limit)
  3. Vertical scaling (buying a bigger server) is no longer cost-effective or not feasible
  4. You need fault isolation — one shard failing should not take down the entire system

Sharding is a last resort after exhausting simpler solutions: caching, read replicas, query optimization, and vertical scaling.


Q2. What is a shard key and why is it the most important decision?

Answer:

The shard key is the column (or combination of columns) whose value determines which shard a given row lives on. The shard key is passed through a routing function (hash, range lookup, directory lookup) that returns the target shard.

It is the most important decision because it is practically irreversible. Changing a shard key after data is written requires:

  • Reading every row in every shard
  • Computing the new shard assignment for each row
  • Physically moving rows to their new shards
  • Handling all writes arriving during the migration
  • Updating all application queries

For billions of rows, this is a months-long project. Choose wrong and you live with the consequences at scale.

The shard key must satisfy four criteria:

  • High cardinality (many distinct values, at minimum as many as you want shards)
  • Even write distribution (no sequential or skewed inserts)
  • Query locality (most frequent queries hit one shard, not all)
  • Immutability (the value never changes for a given row)

Q3. What is the difference between horizontal and vertical scaling?

Answer:

Vertical scaling (scale up): Adding more resources to a single server — more CPU, more RAM, faster disk. Simple. No code changes. But has a hard ceiling (you can only buy so large a server) and creates a single point of failure.

Horizontal scaling (scale out): Adding more servers. Each server holds a subset of data and handles a subset of requests. No ceiling — you can keep adding servers. But requires distributing data and requests across servers, which is what sharding enables for databases.

The application tier is usually horizontally scaled first (10 identical app servers behind a load balancer). Databases are harder to scale horizontally because of data consistency and transaction requirements. Sharding is how you horizontally scale your write-heavy database.


Q4. What is the difference between database partitioning and sharding?

Answer:

Partitioning: Divides data within a single database instance. All partitions live on the same physical server, same database process. It is a performance optimization — queries can skip irrelevant partitions (partition pruning). It does NOT increase write throughput (still one server) and does NOT add fault tolerance.

Sharding: Divides data across multiple independent database instances on different physical servers. Each shard is a separate server with its own CPU, RAM, and disk. It scales both capacity AND throughput. One shard failing affects only a subset of users.

Think of it this way: partitioning is a filing cabinet with labeled drawers on one desk. Sharding is having multiple desks in different rooms, each with their own filing cabinet.


Q5. What are the three main types of sharding?

Answer:

  1. Range-based sharding: Divides the shard key value range into continuous segments. Example: user IDs 1-1M on Shard 1, 1M-2M on Shard 2. Good for range queries. Bad if writes are sequential (all new users go to the last shard).

  2. Hash-based sharding: Applies a hash function to the shard key and uses modulo to determine the shard. Example: shard = hash(user_id) mod 4. Even distribution. Bad for range queries (scatter-gather). Painful to reshard with simple modulo.

  3. Directory-based sharding: Maintains an explicit lookup table mapping entities to shards. Maximum flexibility. Bad if the directory becomes a bottleneck or single point of failure.

A fourth type worth mentioning:

  1. Geographic sharding: Assigns data to shards based on geographic region. Used for data sovereignty (GDPR) and latency optimization.

Level 2 - Intermediate

Expected at: Mid-level to Senior engineers. Tests application of concepts.


Q6. You have a users table with 1 billion rows. What shard key would you choose and why?

Answer:

I would shard by user_id using hash-based sharding.

Reasons:

  • user_id has high cardinality (1 billion distinct values = can have up to billions of shards)
  • user_id is immutable (assigned at registration, never changes)
  • user_id with a hash function distributes writes evenly — no hot shards
  • The most common query patterns ("get this user's profile", "get this user's orders") are by user_id — single shard hit

What I would NOT choose:

  • created_at: All new registrations go to the latest shard (hot shard). Range queries are efficient but writes are not.
  • email: Mutable (users change email). Putting PII in routing logic is a security concern.
  • country: Low cardinality (200 values). US users would dominate one shard.
  • gender: Only 2-3 values. You cannot have more than 3 shards.

If the system also has a strong geographic data residency requirement (GDPR), I would use a compound shard key: (region, user_id_hash) — India data stays in India shards, EU data stays in EU shards, within each region distributed by user_id hash.


Q7. What is the hot shard problem and how do you solve it?

Answer:

A hot shard is one that receives disproportionately more reads or writes than other shards, becoming the bottleneck for the entire system.

Causes:

  1. Uneven data distribution: Bad shard key with low cardinality or skewed values. Example: sharding orders by status when 90% of rows have status = 'DELIVERED'.
  2. Sequential key inserts: Sharding by an auto-incrementing ID with range-based sharding means all new writes go to the last shard permanently.
  3. Behavioral skew: Data is evenly distributed but some entities are accessed far more than others — the celebrity problem. Sachin Tendulkar has 40 million followers; his data shard gets 40 million times more reads.

Solutions:

For uneven distribution: Choose a better shard key with hash function. Hash distributes even if the original key values are skewed.

For sequential inserts: Use non-sequential IDs (Snowflake, UUID) with hash sharding.

For celebrity/behavioral skew:

  • Replicate celebrity data to all shards (fan-out replication). Any shard can serve reads for celebrities.
  • Add a Redis cache layer in front of the hot shard. Database sees only cache-miss traffic.
  • Move the hot entity to a dedicated high-capacity shard with a custom routing rule.

Q8. What happens when you add a new shard? How does data get redistributed?

Answer:

This depends on which sharding strategy you use.

With simple modulo hashing (hash mod N, then moving to hash mod N+1):
The problem is severe. When you go from 4 to 5 shards, approximately 80% of all keys map to a different shard under the new modulo. 80% of your data must be physically moved. During this migration, the old and new mappings are inconsistent. You need a dual-read strategy: check both old and new shard for a period, write to both.

With consistent hashing:
Adding a new node means only 1/N of data (proportionally) needs to move. The new node takes over a portion of the ring from its neighbors. Other shards are completely unaffected. This is the primary reason consistent hashing exists.

With logical sharding:
You pre-define many more logical shards than physical shards (say, 1000 logical shards on 10 physical servers). When you add an 11th physical server, you move some logical shards (with their data) to it. No change to the overall number of logical shards, no recomputation of shard assignments. Only the physical hosting changes.

With database-native sharding (MongoDB, Cassandra):
The database handles this automatically. MongoDB's balancer redistributes chunks. Cassandra's bootstrap process streams data from existing nodes to the new node.

In practice:
Most production systems use one of: (a) consistent hashing, (b) logical shards, or (c) a managed database that handles it automatically. Simple modulo resharding is only acceptable for very low-traffic systems.


Q9. How do you handle cross-shard JOIN queries?

Answer:

JOINs across shards are one of the most painful aspects of sharding. A standard SQL JOIN requires data from two sources to be co-located in the same database process. Across shards, this is not possible natively.

Options:

Option 1: Application-level JOIN
Fetch data from Shard A into application memory. Fetch data from Shard B into application memory. Perform the join in application code (nested loop join or sort-merge join in code). Works but uses application memory and CPU. Bad for large result sets.

Option 2: Denormalization — store joined data together
Instead of joining at query time, pre-compute and store the joined data on the same shard.
Example: Instead of separate users and orders tables (users might be on different shards), store the relevant user details (name, email, address) directly on the orders shard. Slight data duplication but eliminates the cross-shard join.

Option 3: Route all related data to the same shard
Design your shard key so that data that is frequently joined lives on the same shard. If you always JOIN orders with order_items, shard both by order_id. If you always JOIN orders with users, shard both by user_id.

Option 4: Separate OLAP database for analytics
For analytical queries that need to join many tables across many shards, pipe all data to a data warehouse (BigQuery, Redshift). JOINs in the data warehouse are not cross-shard (all data is consolidated). This is the standard enterprise architecture.

The right answer in an interview is Option 2 or Option 3 for the hot path, and Option 4 for analytics. Option 1 is a pragmatic fallback for low-frequency queries.


Q10. What is consistent hashing and why does it matter for sharding?

Answer:

Consistent hashing is an algorithm that distributes keys across nodes such that when a node is added or removed, only a minimum number of keys need to be remapped.

How it works:
Imagine a ring (circle) representing the entire hash space (0 to 2^32). Both nodes and keys are placed on this ring by hashing them. A key is assigned to the first node you encounter walking clockwise from the key's position.

Why it matters for sharding:
With simple modulo hashing (key mod N), adding or removing a node changes N, causing most keys (approximately (N-1)/N of them) to remap to different nodes. For a 1 TB database with 4 shards going to 5 shards, 800 GB of data must move.

With consistent hashing, adding one node means only the keys between the new node and its predecessor need to move. That is approximately 1/N of all keys — only 200 GB for the same example.

Virtual nodes improve this further: each physical node occupies multiple positions on the ring (100-200 virtual positions), ensuring even distribution and granular load transfer when nodes change.

Used by: Cassandra (Murmur3 hash ring), Amazon DynamoDB (internally), Redis Cluster (16,384 hash slots — conceptually similar), Apache Cassandra, Riak, and many more.


Q11. What is a distributed transaction and why is it hard across shards?

Answer:

A distributed transaction is a transaction that modifies data on multiple independent database nodes and requires all-or-none (atomic) semantics across all nodes.

In a single database with @Transactional, atomicity is guaranteed by the database's WAL (Write-Ahead Log) and lock manager. In a sharded system, there is no single WAL or lock manager.

Why it is hard:

The fundamental challenge is the Two-Generals Problem: how do you get two nodes to agree to commit when communication can fail? If Node A commits but the commit message to Node B is lost, you have partial commit.

Two-Phase Commit (2PC) is the classical solution:

  • Phase 1 (Prepare): Coordinator tells all nodes to prepare and lock resources. All nodes vote yes or no.
  • Phase 2 (Commit or Rollback): If all votes are yes, coordinator tells all to commit. If any vote is no, coordinator tells all to rollback.

Problems with 2PC:

  1. Blocking: If the coordinator crashes between Phase 1 and Phase 2, all participating nodes are stuck holding locks indefinitely (in-doubt transaction). The system is frozen until the coordinator recovers.
  2. Latency: Every cross-shard write now requires 2 network round trips minimum.
  3. Does not scale: Lock contention under high write load is severe.

Better alternative for most applications: Saga pattern
Break the distributed transaction into a sequence of local transactions. Each step publishes an event. If a step fails, compensating transactions (explicit undo actions) are executed for all previous steps. Eventually consistent, not atomically consistent.

The key insight for the interview: avoid designing systems that require cross-shard transactions. If you find yourself needing them, reconsider your data model and shard key.


Level 3 - Advanced

Expected at: Senior to Staff engineers. Tests depth and real-world experience.


Q12. Your orders table is sharded by customer_id. A new requirement is to show a leaderboard of top-selling products today. How do you implement this?

Answer:

This is a scatter-gather problem. The products sold today are distributed across all shards (since they belong to different customers, different shards). There is no way to answer "top-selling products today" from a single shard.

Naive approach (wrong for production):
Query all shards: SELECT product_id, COUNT(*) FROM orders WHERE date = today GROUP BY product_id. Aggregate results across shards. This works but hits all shards for every leaderboard refresh — expensive at scale.

Correct production approach:

  1. Event streaming: Every time an order is placed (any shard), publish an OrderPlaced event to Kafka with the product_id.

  2. Stream processing: A Flink or Spark Streaming job consumes all OrderPlaced events, counts product sales using a 24-hour sliding window, and writes the current top-N to Redis:

    Redis Sorted Set: "leaderboard:products:today"
    ZADD leaderboard:products:today <sales_count> <product_id>
    
  3. Serving: Leaderboard reads come from Redis (single O(log N) query). Zero shard scatter-gather.

  4. Refresh: The stream job updates Redis continuously (near real-time). TTL on the Redis key = 24 hours.

This is how Zerodha serves live "most traded stocks today", how Swiggy shows "trending restaurants in your city", and how Amazon shows "best sellers today". The write path (orders) is sharded for scale. The read path (analytics) goes through a real-time pipeline to a denormalized read store.


Q13. How does Instagram generate unique photo IDs across their sharded PostgreSQL setup?

Answer:

Instagram uses a 64-bit ID scheme where the shard ID is embedded in the generated ID itself.

Bit layout:

[41 bits: milliseconds since Instagram epoch]
[13 bits: logical shard ID]
[10 bits: per-shard sequence (from PostgreSQL SEQUENCE)]

How it works on each shard:
Each PostgreSQL shard runs a stored function that:

  1. Gets current timestamp in milliseconds
  2. Knows its own logical shard ID (hardcoded per shard)
  3. Fetches the next value from its local sequence (resets at 1024)
  4. Assembles the three parts into a 64-bit integer

Since each shard has a unique shard ID (embedded in bits 11-23), two shards can never generate the same ID even if they happen to generate at the same millisecond with the same sequence number.

The key property:
Given any photo ID, you can extract the shard ID:

int shardId = (int)((photoId >> 10) & 0x1FFF);

No lookup table needed. The routing information is baked into the ID.

Why this is brilliant: No central ID coordinator. No single point of failure for ID generation. Works at any scale. IDs are time-sortable (41-bit timestamp prefix). IDs fit in PostgreSQL BIGINT.


Q14. You have 10 MySQL shards. A developer runs this query: SELECT * FROM users WHERE email = 'user@example.com'. What happens and how do you fix the design?

Answer:

What happens: Email is not the shard key (let us assume user_id is). The routing layer does not know which shard holds this email. It has two options:

  1. Scatter-gather: Send the query to all 10 shards. Each shard searches its users table for that email. Aggregate the results (at most one result). This works but is 10x more expensive than it needs to be.

  2. Fail with an error: Some sharding middlewares refuse to execute queries that do not include the shard key, forcing the developer to rewrite the query.

The root problem: You have a sharded table but you need to look up by a secondary attribute (email) that is not the shard key.

Solutions:

Solution 1: Secondary index table (most common)

Maintain a separate, non-sharded lookup table that maps email to user_id:

-- In a separate non-sharded "index database"
CREATE TABLE email_to_user_id (
    email   VARCHAR(255) PRIMARY KEY,
    user_id BIGINT NOT NULL
);

Lookup by email:

1. Query email_to_user_id for user_id -> fast, single table
2. Use user_id to route to correct shard -> single shard query

Write path: When creating a user, write to both the users shard AND this index table. Both writes should be in an atomic operation (outbox pattern + event-driven).

Solution 2: Dual-write to a search index

Write user data to both the sharded database and an Elasticsearch index. Email lookups go to Elasticsearch. User data mutations go to the sharded DB. Elasticsearch is not sharded by user_id so it can search any field.

Solution 3: Denormalize — store user_id in the auth system

When a user logs in with email, the authentication service (which has its own un-sharded credential store) returns the user_id. All subsequent requests use user_id. Email is only needed at login.

The correct answer in an interview shows that you understand secondary index tables and the tradeoff between lookup complexity and query performance.


Q15. Explain the concept of logical vs physical sharding and why it is important.

Answer:

Physical shard: An actual database server (or VM) that holds data. You have, say, 10 physical MySQL servers.

Logical shard: An abstract unit of data partitioning that maps to a physical shard. You might define 1,000 logical shards while having only 10 physical servers. Each physical server hosts 100 logical shards.

Why this is important:

Without logical sharding (N physical shards = N routing buckets):

Your routing function is: physical_shard = hash(user_id) mod 10

When you add an 11th physical server, you must change the mod from 10 to 11. This causes ~91% of keys to remap to different shards. You must move almost all data. Catastrophic.

With logical sharding (1000 logical shards, 10 physical shards):

Your routing function has two steps:

logical_shard = hash(user_id) mod 1000      (stays the same forever)
physical_shard = shard_map[logical_shard]   (lookup: which server hosts this logical shard?)

The shard_map might say:

Logical shards 0-99:   Physical server 1
Logical shards 100-199: Physical server 2
...
Logical shards 900-999: Physical server 10

When you add the 11th server, you update the shard_map to move some logical shards:

Logical shards 0-90:   Physical server 1    (moved 10 logical shards from server 1)
Logical shards 100-190: Physical server 2   (moved 10 logical shards from server 2)
...
Logical shards 900-990: Physical server 10  (moved 10 logical shards from server 10)
Logical shards 91-99, 191-199, ..., 991-999: Physical server 11 (10 shards each, 100 total)

Only 10% of data moves. The logical shard count (1000) never changes, so the routing function hash(key) mod 1000 never changes, so no key remapping. You only update the physical mapping table.

Real-world use:

  • Facebook: 10,000 logical shards on ~500 physical MySQL servers
  • MongoDB: Chunks are logical shards automatically managed by the balancer
  • Vitess: Keyspace shards that can be split and merged

Q16. What is a "scatter-gather" query? What are its performance characteristics and when is it acceptable?

Answer:

A scatter-gather query is one that cannot be satisfied from a single shard and must be fanned out to all shards simultaneously.

How it works:

  1. Scatter: The routing layer broadcasts the query to all N shards simultaneously
  2. Each shard executes the query on its local data subset
  3. Gather: The routing layer collects all N results
  4. Merge: Results are aggregated (sort, union, aggregate) and returned to the caller

Performance characteristics:

  • Latency: Bounded by the SLOWEST shard. If shard 7 is having a slow disk day, the entire scatter-gather waits for it. The P99 latency of a scatter-gather is the P99 of the slowest shard.
  • Throughput impact: Each query consumes resources on ALL shards. A 100-shard scatter-gather query has 100x the resource impact of a single-shard query. Under high scatter-gather load, you can overload the cluster.
  • Aggregation overhead: Sorting 10 million rows returned from 100 shards is expensive in the aggregation layer.
  • Partial failures: If one shard is down or times out, you return a partial result (or fail the entire query). Both outcomes are bad.

When scatter-gather is acceptable:

  • Low-frequency queries (once per hour for a dashboard, not per user request)
  • Queries where approximate results are acceptable (counts, HyperLogLog estimates)
  • Migrations or one-off data corrections
  • Internal admin tools where latency is not customer-facing

When scatter-gather is NOT acceptable:

  • Any query in the hot path (user-facing, called > 10 times/second)
  • Queries that return large result sets (sorting 1 million rows from 100 shards)

Better alternatives to scatter-gather:

  • Real-time aggregation pipeline (Kafka + Flink writing aggregates to Redis)
  • Separate OLAP database (BigQuery, Redshift) for analytical queries
  • Denormalized secondary indexes for common lookup patterns

Level 4 - Expert and Architect Level

Expected at: Staff, Principal, or Architect level. Tests system thinking and tradeoffs.


Q17. You are the principal engineer at a startup that has just hit 100 million users and needs to shard for the first time. Walk me through your complete approach.

Answer:

I would follow a systematic process:

Step 1: Profile before sharding — validate the need

Before touching anything, spend two weeks profiling:

  • Which tables are large? (> 500 GB is a candidate)
  • Which queries are slow? (identify if it is query design or data volume)
  • What is the read:write ratio? (high read:write = caching might be enough)
  • What is the write throughput? (if < 10K writes/second, a properly sized server might still work)
  • Can read replicas solve the read problem?

Many startups at 100M users do not actually need sharding yet. They need better indexes, caching, and read replicas.

Step 2: If sharding is needed — choose the shard key carefully

Analyze query patterns across all services:

  • What is the most common query shape? (by user_id, by order_id, by region?)
  • What tables join frequently? (shard them by the same key)
  • What are the cross-shard query candidates? (design around them)
  • Are there celebrity/hot entity risks?

I would spend 2-3 weeks on shard key analysis. I would model different shard key choices against our actual production query log.

Step 3: Choose the sharding architecture

For a startup on MySQL/PostgreSQL, I would evaluate Vitess or Citus (for PostgreSQL) as the sharding layer. Application-level sharding requires every team to implement routing logic, which is error-prone and creates maintenance burden.

Step 4: Plan the ID migration

If the current system uses auto-increment IDs (it usually does), I must migrate to Snowflake IDs or similar before sharding. Auto-increment across shards creates collisions. This step alone can take a month: generate new IDs for all existing rows, update all foreign key references, deploy application code to use new ID format.

Step 5: Start with logical sharding

Define 1,000 logical shards. Initially, all 1,000 map to the existing 1 physical database. This is just a config change. No data movement yet.

Step 6: Migrate incrementally

Gradually move logical shards from the single physical database to new physical shards:

  • First move 10% of traffic (100 logical shards) to a new server. Validate.
  • Then move another 10%. Validate.
  • Continue until evenly distributed.

Each move is a logical shard migration, not a full resharding event.

Step 7: Harden operations

  • Per-shard monitoring dashboards (query rate, CPU, data size)
  • Automated backup and restore testing for all shards
  • Runbook for adding new shards (should be a documented, tested procedure)
  • Alert on hot shards (any shard > 150% of average load)
  • Cross-shard transaction audit (find all @Transactional calls that span shards)

Timeline: For a serious 100M user system, I would estimate 3-6 months for the full migration, done with zero downtime.


Q18. At Uber, trip data needs to be accessible globally (a trip started in Mumbai might be completed using a different datacenter). How would you design the sharding strategy?

Answer:

This is a geo-distributed sharding problem with an important constraint: a trip starts and ends in the same city but the system must be resilient to datacenter failures.

The Core Design Decisions:

1. Shard key: city_id or region_id

Trip data is fundamentally geographic. A Mumbai trip involves a Mumbai driver and a Mumbai rider. This data should live in the Mumbai/India region for:

  • Low latency (driver app and rider app connect to local datacenter)
  • Data sovereignty (Indian user data in India per PDPA regulations)
  • Failure isolation (Mumbai datacenter going down should not affect Singapore trips)

2. Within a region: shard by driver_id or trip_id

Within the India region, further shard by driver_id or trip_id for scale. Driver's entire trip history on one shard. trip_id hash for even distribution.

3. Handle the global query problem:

"Show me all trips globally" (operations dashboard) requires cross-region scatter-gather. This is acceptable for operations dashboards (low frequency, internal use). For real-time trip tracking, you only ever need the local region's data.

4. Cross-region resilience — not cross-region transactions:

If the Mumbai datacenter goes down during a trip:

  • Active trips must be rerouted to a backup region (Singapore)
  • This requires trip state to be periodically replicated to the backup region
  • Uber uses a "warm standby" model: trip state is replicated to the DR region asynchronously, with acceptable staleness of a few seconds

5. Trip lifecycle as a Saga (not a distributed transaction):

A trip: request -> match -> pickup -> in-transit -> dropoff -> payment. Each step is a local transaction on the city shard. If a step fails, compensating actions handle it. No cross-shard 2PC is used for the trip lifecycle.

6. Global ID: Snowflake IDs with datacenter ID embedded (Uber's own Snowflake variant).

The answer demonstrates: geographic sharding + data sovereignty + failure isolation + eventual consistency + Saga pattern. This is how Uber actually works.


Q19. Facebook has 3 billion users and each user can have up to 5,000 friends. The social graph (friendships) needs to be sharded. How would you do it?

Answer:

The social graph is a bipartite problem: if I shard friends by user_id, then "Alice is friends with Bob" would have:

  • Alice's friendship list on Alice's shard
  • But Bob's friendship list on Bob's shard

Reading "all of Alice's friends" = single shard.
Reading "who is friends with Alice" (reverse lookup) = scatter-gather, unless you store both directions.

Facebook's actual approach (TAO):

Facebook built TAO (The Associations and Objects), a graph database built on top of sharded MySQL.

Object: A node in the graph (a user, a post, a page). Sharded by object_id hash. Object lives on one shard.

Association: A directed edge in the graph (Alice -> Bob: friends). Stored TWICE:

  • Forward edge: (Alice_id, FRIEND, Bob_id) stored on Alice's shard
  • Inverse edge: (Bob_id, FRIEND_INVERSE, Alice_id) stored on Bob's shard

This doubles storage but eliminates scatter-gather for both "who does Alice follow" and "who follows Alice".

The TAO cache layer:

  • MySQL shards are the persistent store (20,000+ MySQL servers)
  • TAO cache servers are in front (Memcached-based, organized by region and rack)
  • Cache is the primary read path — 99%+ of reads are served from TAO cache
  • MySQL is the persistence layer — written to asynchronously via TAO write-through

For interview discussion, the key insights are:

  1. Store edges in both directions (forward + inverse) to avoid scatter-gather
  2. Identify what queries you need (Alice's friends, mutual friends, second-degree connections) and design storage to answer each efficiently
  3. Cache is essential — 3 billion users means the social graph is mostly static (friendships change rarely). Cache hit rate is very high.
  4. Sharding by user_id hash works if you store both directions of edges.

Q20. How do you perform an online schema migration (adding a column) across 500 database shards without downtime?

Answer:

This is a critical operational problem. Naive approaches cause either downtime or hours of inconsistent schema state where some shards have the new column and others do not.

The key principle: backward-compatible migrations using the expand-contract pattern.

Phase 1: Expand (deploy this first, no schema change yet)

Deploy application code that is capable of reading from BOTH the old schema and new schema. The code does not USE the new column yet — it is just ready for it. No schema changes applied.

Phase 2: Apply schema migration (backward-compatible change only)

For adding a nullable column with no default, this is safe:

ALTER TABLE orders ADD COLUMN delivery_notes TEXT;

Use an online schema change tool to apply this to all 500 shards without locking:

  • pt-online-schema-change (MySQL): Creates a shadow table, copies data, uses triggers for live changes, atomic rename.
  • gh-ost (GitHub): Uses binary log streaming instead of triggers — less write amplification.
  • pg_repack (PostgreSQL): Online table reorganization.

These tools can run on all 500 shards in parallel (they do not lock, they copy incrementally). A 1 TB shard might take 2-3 hours. You monitor progress and errors.

Phase 3: Contract (deploy application code that uses the new column)

Only after ALL 500 shards have the new column do you deploy the code that writes to it. Now the application can use delivery_notes.

Phase 4: Backfill (if needed)

If existing rows need values in the new column, run a background backfill job. Update rows in small batches (100 at a time) with sleeps between batches to not overload the shards.

Phase 5: Final contract (if removing old columns)

Only remove old columns after you have verified no code reads them anywhere. This is the final contract step.

Orchestration for 500 shards:

Use a migration management system:

  • Liquibase or Flyway with a custom executor that runs against each shard's JDBC URL
  • Track which shards have completed which migration version
  • Retry failed shards automatically
  • Alert on shards that fall behind

The most important interview point: You never deploy schema changes simultaneously with application code changes. Schema change goes first (and must be backward-compatible). Application code that uses the new schema goes second, after all shards are migrated.


Tricky Questions That Candidates Get Wrong

These questions have counterintuitive answers or require distinguishing between similar-seeming but different concepts.


T1. "I have 10 shards with hash-based routing. My data is evenly distributed. But one shard is still getting 5x more traffic than others. How is this possible?"

Why candidates get this wrong:
They assume even data distribution = even load distribution. They conflate data volume with access frequency.

The correct answer:

Even distribution means equal DATA. It does not mean equal ACCESS.

This is the behavioral hot shard problem. Even if each shard holds exactly 10% of users, if the top 1% of most-active users happen to hash to the same shard, that shard serves 5x more requests.

Why "happen to hash to the same shard" happens:
It should be statistically unlikely, but access patterns are correlated. If you have 1000 enterprise customers each with 100 employees, and all employees of each company hash to different shards (which they should), that is fine. But if you have a celebrity feature and all 1000 most-followed accounts happen to cluster on one shard, that is a hot shard.

Diagnosis: Monitor per-shard queries per second (not data size). If data size is even but QPS is not, it is behavioral skew.

Solutions:

  • Cache the data for hot entities (Redis in front of the hot shard)
  • Move identified hot entities to a dedicated high-capacity shard
  • Shard with a more granular key that better distributes access (add a sub-shard dimension)

T2. "You shard by user_id. A user sends a message to another user. Both are on different shards. You need to update both users' message counts atomically. How do you do it?"

Why candidates get this wrong:
They say "use @Transactional" or "use a distributed transaction with 2PC".

The correct answer:

@Transactional does not work across shards. 2PC is a valid but poor choice (blocking, SPOF at coordinator).

The correct answer for most systems is: do not require atomic updates to both users simultaneously.

Rethink the data model:

  • Message count for User A is on User A's shard. Update it when User A sends a message.
  • Message count for User B is on User B's shard. Update it when User B receives a message.
  • These are two separate local transactions. They do not need to be atomic with each other.

If User A's count is updated but User B's count update fails (B's shard is down), the counts are temporarily inconsistent. A reconciliation job runs periodically and fixes discrepancies.

For a message counter, brief inconsistency is acceptable. If this were a financial balance transfer (must be atomic), the correct answer is the Saga pattern with compensation: debit A's shard, then credit B's shard. If the credit fails, run a compensating debit reversal on A's shard.

The key insight: Question whether atomicity is truly required, and if yes, use Saga not 2PC.


T3. "If I double the number of shards (from 4 to 8), will my query performance double?"

Why candidates get this wrong:
They say yes, expecting linear scaling.

The correct answer: it depends, and often no.

Where doubling shards DOES help:

  • Write throughput: Yes, approximately halved write load per shard. Single-shard write queries benefit.
  • Storage capacity: Yes, each shard holds half the data.
  • CPU for single-shard queries: Yes, each query touches half as many rows.

Where doubling shards does NOT help:

  • Scatter-gather queries: Now you query 8 shards instead of 4. The gather step is more complex and latency increases (more network round trips in the gather phase).
  • Cross-shard coordination: More shards = more coordination points = higher overhead for any operation requiring cross-shard communication.
  • Application connection overhead: Each application instance now maintains connection pools to 8 shards instead of 4.

The real answer:
Doubling shards approximately doubles throughput for single-shard queries and write operations, but can actually degrade performance for scatter-gather queries. The sweet spot for shard count depends heavily on your workload mix.


T4. "Can you use @Transactional in a Spring Boot application that uses a sharded database?"

Why candidates get this wrong:
They say either "yes, always" (too optimistic) or "no, never" (too pessimistic).

The correct answer: yes, within a single shard.

@Transactional in Spring Boot controls a transaction on a single JDBC connection to a single database. If your operation routes to a single shard (which should be true for most transactional operations if your shard key is chosen correctly), @Transactional works perfectly.

@Transactional   // This works: both operations go to user 5042's shard
public void updateUserProfileAndLogActivity(long userId) {
    userRepo.updateProfile(userId, newName);      // Shard for user 5042
    activityLog.record(userId, "PROFILE_UPDATED"); // Same shard for user 5042
}

Where @Transactional BREAKS:

@Transactional  // This is broken: two different shards involved
public void transferBalance(long fromUserId, long toUserId, BigDecimal amount) {
    accountRepo.debit(fromUserId, amount);   // May go to Shard 1
    accountRepo.credit(toUserId, amount);    // May go to Shard 4 - different connection!
}

Spring's @Transactional does not know that two different datasource connections were opened. It manages only the transaction on the primary connection. The secondary shard's operation is auto-committed and not part of the transaction.

The key principle: @Transactional is safe in sharded systems as long as all database operations within the transaction scope hit a single shard.


T5. "A competitor database product claims to support 'transparent sharding' — the application sees one database, sharding is invisible. Is this the ideal solution? What are the limitations?"

Why candidates get this wrong:
They say "yes, this is perfect, no more sharding headaches" without thinking through the limitations.

The correct answer:

Transparent sharding (as in Vitess, Citus, MongoDB mongos) is excellent but NOT a magic solution. Limitations include:

1. Scatter-gather still happens transparently:
If you run SELECT COUNT(*) FROM orders WHERE status = 'PENDING', the transparent layer runs scatter-gather on your behalf. Your application does not see the complexity, but the performance cost is real. "Transparent" does not mean "free".

2. Not all SQL works:
Complex JOINs between sharded tables, subqueries with cross-shard data, stored procedures, and some aggregation functions may not work or behave unexpectedly. The proxy must parse SQL and understand shard keys, which is computationally complex.

3. The proxy is a new SPOF and bottleneck:
Every single database query now goes through the proxy. The proxy must be highly available (clustered), highly performant, and correctly configured. One misconfiguration in Vitess routing rules can misdirect all traffic.

4. Operational complexity moves, not disappears:
You no longer write routing logic in application code. Instead, you configure and operate a complex distributed system (Vitess, Citus). The complexity still exists — it is just in a different layer.

5. Schema changes still require planning:
The proxy does not help you run ALTER TABLE across 500 shards safely. You still need the expand-contract pattern.

Conclusion: Transparent sharding is a massive improvement over application-level sharding for most teams. But it is not zero-cost. You must understand what it does under the hood to configure it correctly and debug performance problems.


T6. "You choose email as your shard key for the users table. What specific failure modes will you encounter over time?"

Why candidates answer partially:
They identify one or two problems but miss the full set.

The complete answer:

Failure 1: Email change triggers data migration
When a user changes their email (very common — people get married, change domains, leave companies), the shard key changes. The row must move from the old shard to the new shard. This is a cross-shard atomic write — extremely complex. If it fails halfway, the user's data is in a corrupt/lost state.

Failure 2: Non-uniform distribution with domain clustering
If you use the full email as shard key with range-based sharding:

  • All gmail.com users are adjacent in the range
  • A lot of India-based email might use yahoo.co.in or similar
  • Clustering by domain creates hot shards for popular email providers

If you hash the email: distribution is better but now you have Failure 1 when email changes.

Failure 3: PII in the routing layer
Email is Personally Identifiable Information. Your routing layer (Vitess config, application config, load balancer rules) now contains email addresses. This violates data minimization principles (GDPR) and increases your PII surface area. Security auditors will flag this.

Failure 4: Case sensitivity and normalization
Is User@Gmail.com the same as user@gmail.com? Email is case-insensitive by specification for the domain part. If your hash function is case-sensitive, User@Gmail.com and user@gmail.com route to different shards and you have two records for the same user.

Failure 5: Email reuse
Email addresses can be reused (a user deletes their account, another person creates a new account with the same email years later). With email as shard key, the new user routes to the same shard as the old user (correct) but also risks seeing stale data from the old user if soft-delete is used.

Bottom line: Use user_id (immutable, system-generated, no PII, uniform hash distribution). Look up user_id by email in a separate non-sharded index table used only at login time.


T7. "If one of your 10 shards goes down, what percentage of your users are affected? What is the actual user experience?"

Why candidates answer too simply:
They say "10% of users are affected" without thinking through the full experience.

The nuanced answer:

Immediate impact on reads: 10% of users cannot read their data. For a sharded e-commerce platform, users whose user_id hashes to the failed shard cannot see their orders, cart, or profile.

Immediate impact on writes: 10% of users cannot place orders, update profiles, etc.

Cross-shard impact: Operations that require the failed shard participate in fail too, even for users whose primary shard is healthy. Example: User A (healthy shard) tries to message User B (failed shard). The message send fails even though User A's shard is fine.

With replicas: If each shard has read replicas, reads continue from the replica. Writes fail until the primary is restored or a replica is promoted.

With replica promotion (Redis Sentinel / MySQL MHA):

  • Replica is promoted to primary: typically 30-60 seconds
  • During those 30-60 seconds: writes fail for 10% of users
  • After promotion: system fully operational

The user experience:

  • E-commerce: Users affected see error pages. Users on other shards shop normally.
  • Banking: Affected users cannot see balance or make transfers (critical). This triggers a P0 incident.
  • Social media: Affected users cannot post or see their feed (user-visible but lower severity).

The professional answer also covers prevention:

  • Replicas in multiple availability zones per shard
  • Automatic failover with health checks
  • Circuit breaker: quickly recognize the shard is down, return a clear error to affected users rather than hanging
  • Chaos engineering: test shard failures regularly in a staging environment to ensure failover works

Scenario-Based Design Problems

These are full system design scenarios where sharding is a key consideration.


S1. Design the database sharding strategy for a UPI payment system (like PhonePe) that processes 500 million transactions per month.

Key considerations:

Scale math first:
500M transactions/month = 17M/day = 190 transactions/second average.
Peak (salary day, festival): 10-20x average = 1,900 to 3,800 TPS.
Each transaction: ~5 KB stored (metadata, status, timestamps).
Monthly data: 500M x 5KB = 2.5 TB/month.
Annual data: 30 TB.

Shard key analysis:

  • Shard by sender_user_id: All transactions sent by a user on one shard. Fast for "show my transactions". But a power user (merchant with 10,000 daily transactions) creates a hot shard. And the query "show all transactions received by me" requires cross-shard lookup.

  • Shard by transaction_id (Snowflake ID, hash-based): Even distribution. But "show my transactions" requires scatter-gather — terrible for user-facing queries.

  • Best option: Shard by user_id (hash) for the user-facing transaction ledger. Maintain a separate event log sharded by transaction_id for the core payment processing.

Two separate stores:

  1. Transaction Event Log (append-only, sharded by transaction_id hash):

    • Immutable record of what happened
    • Used by settlement, reconciliation, regulatory reporting
    • Shard by transaction_id (Snowflake ID): even distribution, hot path for transaction processing
  2. User Ledger (sharded by user_id):

    • Each user's transaction history
    • Used for "show my transactions" in the app
    • Written to asynchronously after event log entry is confirmed
    • Sharded by user_id: single-shard for user queries

Compliance requirement: RBI mandates transaction data for 7 years. Use time-based partitioning WITHIN each shard (PostgreSQL range partition by month). Old partitions can be archived to cold storage while recent data stays hot.

Global IDs: Snowflake ID with PhonePe datacenter ID embedded. Transaction IDs must be globally unique and time-sortable.


S2. You are designing a notification system that must send 100 million push notifications per day for an Indian super-app. Messages must not be delivered twice and must be deliverable even if the user's device is offline. How do you shard this?

Core insight: This problem requires different shard keys for different operations.

The notifications table needs to satisfy:

  1. "Send notification to user X" — needs to find user X's notification queue
  2. "User X's unread notifications" — needs all of X's recent notifications
  3. "Has this notification been delivered?" — deduplication check
  4. "Send to all users in city Y" — broadcast, inherently scatter-gather

Shard by user_id for the notifications inbox:

Every user has a notification inbox. Sharding by user_id puts all of a user's notifications on one shard. When the user opens the app, one shard query returns all unread notifications.

Message deduplication:

Each notification has a notification_id (Snowflake ID). A separate Redis cluster checks deduplication:

SETNX "delivered:notification_id:XXXXX" 1 EX 259200  (3 days TTL)

If SETNX returns 0, this notification was already delivered. Do not deliver again.

Offline delivery:

Notifications for offline users are stored in the user's shard. The push service polls for pending notifications when the device reconnects. TTL on each notification (7 days: if the user doesn't open the app in 7 days, the notification is expired).

Broadcast notifications (system-wide announcements):

Do NOT store one row per user for broadcasts. That is 100 million rows per announcement. Instead:

  • Store one broadcast row with the message
  • Store per-user "seen" status only (a bitmap or a sparse table)
  • Users fetch broadcast notifications from a single-server "broadcasts" table + their own "seen" bitmask

What Interviewers Actually Look For

In a Junior/Mid-Level Candidate:

  • Can you explain the basic concept clearly? (phone book analogy, data split across servers)
  • Do you know the types? (range, hash, directory, geographic)
  • Can you identify a good shard key? (high cardinality, immutable, even distribution, query locality)
  • Do you know the basic problems? (hot shard, cross-shard queries, resharding complexity)

In a Senior Engineer:

  • Can you go beyond theory to tradeoffs? (why range is good for ranges but creates hot shards for sequential data)
  • Can you design for a specific scenario? (given this query pattern and this scale, choose and justify a shard key)
  • Do you understand the cross-shard transaction problem and Saga pattern?
  • Have you dealt with resharding? (consistent hashing, logical shards, online migration)

In a Staff/Principal Engineer:

  • Can you design the end-to-end sharding architecture including migration plan?
  • Do you know when NOT to shard? (read replicas, caching, partitioning might be enough)
  • Can you discuss real company architectures? (Facebook TAO, Vitess at YouTube, Instagram ID generation)
  • Do you understand the operational implications? (backup complexity, schema migrations, monitoring per shard)
  • Can you evaluate tradeoffs at the system level? (latency vs consistency, storage cost vs query simplicity)

Common Mistakes to Avoid

MistakeWhy It is WrongWhat to Say Instead
Saying sharding solves all database problemsSharding helps with scale but adds complexity. It is not for every problem."Sharding is a last resort after caching, read replicas, and vertical scaling are exhausted."
Recommending sharding for a system with 1 million users1 million users rarely needs sharding. Premature optimization."At this scale, a well-tuned single server with read replicas is sufficient. Sharding adds complexity you don't need yet."
Saying @Transactional works across shardsIt does not. It controls one database connection."Transactions within one shard work normally. Cross-shard operations require the Saga pattern."
Choosing email or name as shard keyMutable. PII. Non-uniform distribution."Always use an immutable system-generated ID as shard key."
Ignoring hot shards in the designEvery design has potential hot spots."The celebrity problem and sequential key insertion are the two main hot shard risks. I would mitigate them by..."
Proposing 2PC for cross-shard transactions2PC is blocking, has a coordinator SPOF, and does not scale."For cross-shard atomicity, I would use the Saga pattern with compensating transactions."
Confusing partitioning with shardingPartitioning is same server, different tables. Sharding is different servers."Partitioning is a performance optimization within one server. Sharding is a scaling architecture across multiple servers."
Forgetting about ID generationCross-shard ID collisions are a real problem."The first thing I would design is the global ID generation scheme — Snowflake IDs or Instagram-style IDs."
Ignoring the resharding problem"We'll add more shards when we need to" without a plan."I would use logical sharding from day one — 1000 logical shards on N physical servers — so adding capacity never requires a resharding event."
Designing for scatter-gather queries on hot pathsScatter-gather is 100x more expensive at scale."I would identify all scatter-gather query patterns and address them with a real-time aggregation pipeline or secondary indexes, not direct scatter-gather from the application."