Circuit Breaker Pattern: The Complete Guide
Who is this for? Whether you are a software engineer, architect, student, or a business professional trying to understand system reliability -- this guide covers everything you need to know about the Circuit Breaker Pattern from the ground up.
Table of Contents
- Origin & History
- The Simple Analogy
- What is the Circuit Breaker Pattern?
- How It Works -- The State Machine
- Sliding Window Types
- Why Is It Important?
- Use Cases & Applications
- The Resilience Pattern Family
- When NOT to Use Circuit Breaker
- Interview Preparation
- Common Pitfalls
- Circuit Breaker in Java
- Circuit Breaker in Python
- Circuit Breaker in Cloud-Native & Service Meshes
- Quick Reference Cheat Sheet
Origin & History
The Circuit Breaker Pattern was popularized by Michael T. Nygard in his landmark book "Release It! Design and Deploy Production-Ready Software" (2007). It was further documented and widely spread by Martin Fowler in his famous blog post (2014) which became the de-facto reference for engineers worldwide.
Netflix played a pivotal role in making this pattern mainstream in the microservices era. They built Hystrix, an open-source circuit breaker library, to manage the resilience of their massive distributed system serving millions of users. Netflix's engineering blog posts about Hystrix sparked industry-wide adoption.
Key Insight: Before this pattern was formalized, engineers handled failures with ad-hoc timeouts and retries -- resulting in "retry storms" and cascading failures that took down entire systems. The Circuit Breaker Pattern gave the industry a principled, structured approach to failure management.
The Simple Analogy (For Everyone)
Imagine the electrical circuit breaker box in your home or office:
- When everything is normal, electricity flows freely to all appliances. ✅
- If there is a power surge, a short circuit, or an overload, the breaker trips (switches off). ❌
- Once tripped, NO electricity flows to that circuit -- protecting your appliances and preventing a fire.
- After the electrician fixes the problem, they reset the breaker to restore power.
- A cautious electrician might first test with a small load before restoring full power.
Software works exactly the same way:
- When everything is normal, requests flow freely to external services. ✅
- If a service keeps failing, the circuit breaker trips -- stopping all requests to that service. ❌
- This protects the rest of your system from being dragged down.
- After a timeout, the system tests with a few requests to see if the service recovered.
- If it did, normal traffic resumes. If not, the breaker stays open.
What is the Circuit Breaker Pattern?
The Circuit Breaker Pattern is a software design pattern used in distributed systems and microservices to detect repeated failures in external service calls and prevent the application from continuously retrying operations that are likely to fail. It wraps a potentially failing operation and monitors its health, automatically "opening" (stopping calls) when failures exceed a threshold, and "closing" (resuming calls) when the service recovers.
Core Problem it Solves:
In a distributed system, if Service A calls Service B and Service B is down, without a circuit breaker:
- Service A waits for each request to timeout (e.g., 30 seconds each).
- Hundreds of threads pile up waiting.
- Service A exhausts its thread pool and crashes.
- Service C, which depends on Service A, also crashes.
- The entire platform goes down. This is called a Cascading Failure.
The Circuit Breaker stops this chain reaction at the source.
How It Works -- The State Machine
A circuit breaker operates as a finite state machine with three states:
Failure threshold exceeded
┌─────────────────────────────────────────┐
│ ▼
┌─────┴──────┐ ┌────────────┐
│ CLOSED │ │ OPEN │
│ (Normal) │ │ (Tripped) │
└─────┬──────┘ └─────┬──────┘
│ │
│ All calls pass through │ All calls fail immediately
│ Failures are counted │ (no calls to service)
│ │
│ Reset timeout expires
│ │
│ ▼
│ ┌──────────────────┐
│ Test calls succeed │ HALF-OPEN │
└──────────────────────────────│ (Testing...) │
└─────────┬────────┘
│
Test calls fail
│
▼
Back to OPEN
State 1: CLOSED (Normal Operation)
- What happens: All requests are forwarded to the downstream service normally.
- Monitoring: The circuit breaker counts failures (and/or measures response times) using a sliding window.
- Transition: If the failure rate or slow call rate exceeds the configured threshold → moves to OPEN.
- Analogy: Green traffic light -- traffic flows freely.
State 2: OPEN (Fault Detected -- Circuit Tripped)
- What happens: ALL requests are immediately rejected -- the downstream service is not called at all.
- Response: The caller immediately receives either a fallback response or an error (in milliseconds, not seconds).
- Timer: A reset timeout begins counting down (e.g., 30 seconds).
- Transition: When the timer expires → moves to HALF-OPEN.
- Analogy: Red traffic light -- all traffic is stopped.
- Key Benefit: "Fail fast" -- instead of each request waiting 30 seconds to timeout, it fails in <1ms. System resources are preserved.
State 3: HALF-OPEN (Probing Recovery)
- What happens: A limited, controlled number of test requests are allowed through to the downstream service.
- If test calls succeed: Breaker is confident the service has recovered → moves to CLOSED.
- If test calls fail: Service is still unhealthy → moves back to OPEN and resets the timer.
- Analogy: Amber/yellow traffic light -- proceed with caution, limited traffic only.
Sliding Window Types
The circuit breaker uses a sliding window to evaluate whether it should trip. There are two types:
1. Count-Based Sliding Window
- Evaluates the last N calls (e.g., last 10 calls).
- If 5 out of the last 10 calls failed → 50% failure rate → breaker opens.
- Best for: Lower traffic services where you want a fixed number of samples before deciding.
- Example: A batch job that runs 20 times an hour.
2. Time-Based Sliding Window
- Evaluates all calls made within the last N seconds (e.g., last 60 seconds).
- If 50% of calls in the last 60 seconds failed → breaker opens.
- Best for: High-traffic services where a fixed count window resolves too quickly or slowly.
- Example: An API handling 10,000 requests per second.
Count-Based (last 10 calls):
[✓][✓][✗][✗][✗][✗][✗][✓][✓][✗] → 6/10 = 60% failure → OPEN
Time-Based (last 60 seconds):
|------ 60 seconds -------|
✓✓✗✓✗✗✗✗✓✗✗✓✗✗✗✗✓✗✗✗ → Count failures/total → compare to threshold
Recommendation: Use time-based windows for production microservices. Use count-based for lower-traffic or batch processes.
Why Is It Important?
For Engineers & Architects:
- Prevents cascading failures: One bad service can't take down the entire system.
- Preserves resources: No wasted threads, connections, or memory waiting for a dead service.
- Enables graceful degradation: System continues to work (partially) even during outages.
- Reduces Mean Time To Recovery (MTTR): The affected service gets breathing room to recover without being hammered by requests.
- Improves observability: Breaker state is a real-time health signal of your system.
For Business Stakeholders:
- Protects revenue: An open circuit breaker on a non-critical feature doesn't bring down checkout.
- Maintains user trust: Users see a friendly message instead of a crash or indefinite spinner.
- Reduces incident cost: Faster automated recovery means less engineer time spent firefighting at 2 AM.
- Supports SLA compliance: Systems with circuit breakers achieve higher availability (99.9%+).
For Non-Technical Readers:
Think of it as an automatic safety valve in a factory. If one machine starts overheating, the safety valve shuts it down automatically -- preventing the entire factory from catching fire. The rest of the factory keeps running. Once the machine is repaired, the valve opens again.
Use Cases & Applications
Microservices & Distributed Systems
The most common use case. In a microservices architecture with dozens or hundreds of services, circuit breakers are placed around every inter-service HTTP call.
Example: Netflix has hundreds of microservices. Without circuit breakers, a failing Recommendation Service would cascade to the Homepage Service, then to the API Gateway, and take down the entire Netflix app for millions of users. With Hystrix (their circuit breaker), only the recommendations widget goes blank -- everything else works fine.
Third-Party API Integrations
When your application depends on external APIs (payment gateways, mapping services, SMS providers, weather APIs), those APIs can go down or rate-limit you.
Example: A food delivery app uses Google Maps API for routing. If Google Maps is down, without a circuit breaker, every order placement attempt waits 30 seconds and times out. With a circuit breaker, the breaker opens immediately, and the fallback returns a pre-cached route estimate. Orders keep flowing.
Database & Cache Connections
Databases can become overloaded, go into maintenance, or have network issues. Circuit breakers prevent your application from overwhelming a struggling database with thousands of connection attempts.
Example: During a database failover (switching from primary to replica), it takes 10-30 seconds. A circuit breaker opens during this window, returns cached data where possible, and queues writes -- then closes once the new primary is available.
Artificial Intelligence & Machine Learning
This is a rapidly growing application area as AI systems become production infrastructure:
AI Model Serving
- Problem: ML model inference endpoints (e.g., TensorFlow Serving, TorchServe, Triton) can become overloaded or fail, especially with large models (LLMs).
- Solution: Circuit breaker wraps model inference calls. If the primary LLM endpoint fails → fallback to a smaller, faster model → fallback to a rule-based response → fallback to a cached response.
- Real example: A customer service chatbot powered by GPT-4. If the OpenAI API is down, the circuit breaker opens and routes to a smaller local model or a pre-defined FAQ response -- the chatbot keeps working, just with less intelligence.
AI Agent Pipelines & Orchestration
- Problem: Modern AI agents chain multiple LLM calls, tool calls, and external API calls. A failure in any step can hang the entire pipeline.
- Solution: Circuit breakers at each tool/service call boundary in the agent pipeline. If the web search tool fails repeatedly → circuit opens → agent continues without web search capability.
- Frameworks: LangChain, LlamaIndex, and AutoGen workflows benefit from circuit breakers around external tool calls.
LLM API Gateway
- Problem: Companies using cloud LLM APIs (OpenAI, Anthropic, Cohere, Google Gemini) face rate limits and outages.
- Solution: Circuit breaker + provider fallback. If OpenAI rate-limits you → circuit opens → route to Anthropic Claude → if that also fails → route to a local model.
- Real example: LiteLLM (an LLM proxy) implements circuit-breaker-style fallbacks across multiple AI providers.
Vector Database & Embedding Services
- Problem: RAG (Retrieval-Augmented Generation) pipelines call embedding services and vector databases (Pinecone, Weaviate, ChromaDB). If these fail, the entire RAG pipeline fails.
- Solution: Circuit breaker around embedding service calls. Fallback: use keyword search (BM25) instead of vector search when the vector DB is down.
ML Data Pipelines & Feature Stores
- Problem: Streaming ML pipelines (Apache Kafka + Flink/Spark) read from feature stores and data sources that can go offline.
- Solution: Circuit breakers prevent the pipeline from endlessly retrying failed data source connections, allowing the pipeline to pause gracefully and resume when the source recovers.
AI Monitoring & Alerting
- Problem: AI observability tools (Arize, WhyLabs, Evidently) that monitor model drift call data APIs that can fail.
- Solution: Circuit breakers ensure monitoring failures don't affect the primary inference path.
Financial Services & Banking
- Payment processing: Circuit breakers around payment gateway calls prevent checkout failures from propagating to order management.
- Fraud detection: If the real-time fraud scoring service is slow, the circuit breaker can fall back to a simplified rule-based check rather than blocking all transactions.
- Market data feeds: Trading platforms use circuit breakers to handle momentary disruptions in live market data streams.
- Core banking systems: Integration with legacy core banking systems (often slow and unreliable) is protected by circuit breakers.
Healthcare
- EHR integrations: Hospital systems integrate with Electronic Health Records via HL7/FHIR APIs. Circuit breakers prevent a slow EHR from blocking patient admission workflows.
- Medical imaging: Radiology systems calling AI-powered diagnostic services (e.g., detecting tumors in scans) use circuit breakers to fall back to manual review queues when AI services are unavailable.
- Pharmacy systems: Drug interaction checks calling external databases use circuit breakers with cached last-known data as fallback.
Mobile & Web Applications (Backend)
- Social features: If the "likes/comments" service is down, the circuit breaker ensures the main feed still loads (without like counts) rather than the entire app crashing.
- Notifications: Push notification services (APNs, FCM) can throttle or go down. Circuit breakers prevent notification failures from affecting core application logic.
- Search: If the search service is overloaded, the circuit breaker can fall back to basic database queries.
IoT & Industrial Systems
- Sensor data ingestion: IoT devices send telemetry to cloud services. Circuit breakers handle intermittent connectivity of field devices.
- Industrial automation: SCADA systems calling remote PLCs (Programmable Logic Controllers) use circuit breakers to handle network interruptions in factory environments.
- Smart city infrastructure: Traffic management systems calling sensor APIs use circuit breakers to maintain partial functionality during sensor outages.
Gaming
- Matchmaking services: If the matchmaking service is overloaded, the circuit breaker opens and falls back to quicker, less optimal matching to keep players in games.
- Leaderboards & stats: Non-critical game services (leaderboards, achievements) use circuit breakers so their failures don't affect core gameplay.
- Anti-cheat services: If the anti-cheat API is slow, circuit breakers ensure it doesn't delay game sessions.
E-Commerce & Retail
- Inventory service: Prevents a failing inventory check from blocking all add-to-cart operations (can fall back to "in stock" assumption for low-risk items).
- Recommendation engine: Recommendations widget uses a circuit breaker -- if it fails, show bestsellers instead.
- Shipping calculator: If real-time shipping rate API is down, show estimated rates from last-known data.
Cloud & DevOps
- Kubernetes health probes: Combined with readiness probes, circuit breakers and Kubernetes work together -- the breaker stops traffic at the application level, readiness probes stop traffic at the infrastructure level.
- CI/CD pipelines: Build pipelines calling external artifact registries, test services, or deployment targets use circuit breakers to fail fast rather than hanging.
The Resilience Pattern Family -- Where Circuit Breaker Fits
The Circuit Breaker Pattern is one of several resilience patterns. They are most effective when used together as a layered defense:
Incoming Request
│
▼
┌─────────────┐
│ Rate │ ← Prevent overwhelming your own service
│ Limiter │
└──────┬──────┘
│
▼
┌─────────────┐
│ Timeout │ ← Never wait forever for a response
└──────┬──────┘
│
▼
┌─────────────┐
│ Retry │ ← Handle brief, transient failures
└──────┬──────┘
│
▼
┌─────────────┐
│ Circuit │ ← Stop calling a consistently failing service
│ Breaker │
└──────┬──────┘
│
▼
┌─────────────┐
│ Bulkhead │ ← Isolate resource pools per service
└──────┬──────┘
│
▼
┌─────────────┐
│ Fallback │ ← Graceful degraded response
└─────────────┘
| Pattern | What It Does | Analogy |
|---|---|---|
| Timeout | Sets maximum wait time for a response | "I'll wait 5 minutes, then leave" |
| Retry | Automatically retries on transient failures | "Let me try knocking again" |
| Circuit Breaker | Stops calls to a failing service entirely | "That door is broken, I'll stop knocking" |
| Bulkhead | Isolates resource pools so one failure doesn't exhaust all resources | Ship compartments -- one leak doesn't sink the ship |
| Rate Limiter | Controls how many requests are sent per second | "Only 100 people can enter per minute" |
| Fallback | Provides alternative response when primary fails | "Out of coffee? Here's tea instead" |
When NOT to Use Circuit Breaker
Knowing when not to apply a pattern is just as important as knowing when to use it.
❌ Internal, In-Process Method Calls
If you are calling a method in the same application process (not a network call), a circuit breaker adds unnecessary overhead and complexity. Use standard exception handling instead.
❌ Simple, Single-Instance Applications
If your application is a small monolith that only makes one or two external calls and doesn't need high availability guarantees, the overhead of implementing and maintaining a circuit breaker may not be justified.
❌ Operations Where Partial Execution is Dangerous
Some operations must either fully succeed or fully fail -- there is no safe "fallback." For example, a financial debit operation where you deducted money from Account A but the credit to Account B failed. A circuit breaker with a fallback here could cause data inconsistency. Use sagas or distributed transactions instead for such scenarios.
❌ When the Failure IS the Expected Response
If you are calling a service where "not found" or "unavailable" is a normal, expected business response (not an infrastructure failure), wrapping it in a circuit breaker is incorrect. Those are business exceptions, not infrastructure failures.
❌ When You Can't Define a Meaningful Fallback
If a circuit breaker opens and you have absolutely no useful fallback (no cached data, no default, no secondary service), opening the circuit provides no user experience benefit -- you're still showing an error. Consider whether the circuit breaker is adding value, or whether better upstream design (caching, data replication) is the real solution.
Interview Preparation: Key Questions & Extensive Answers
Q1. What is the Circuit Breaker Pattern and why do we need it?
Answer:
The Circuit Breaker Pattern is a software design pattern used in distributed systems and microservices to detect repeated failures and prevent the application from repeatedly trying to execute an operation that is likely to fail. The name comes from the electrical engineering concept -- just like a home circuit breaker trips to prevent an electrical overload and potential fire, a software circuit breaker "trips" to prevent cascading failures.
Why we need it:
In modern distributed systems, services communicate over a network. Networks are inherently unreliable. If Service A calls Service B, and Service B is down or slow, Service A will keep waiting. If many threads in Service A are waiting, Service A itself becomes slow or unresponsive. This can cascade -- Service C depending on Service A also fails, and so on. The Circuit Breaker Pattern stops this chain reaction early by detecting the failure and short-circuiting the calls immediately, returning an error or a fallback response without wasting resources.
Q2. Explain the three states of a Circuit Breaker in detail.
Answer:
1. Closed (Normal Operation)
- All requests are passed through to the remote service.
- The circuit breaker monitors the number of failures (and/or success rate, response time).
- If failures stay within the acceptable threshold, the breaker remains closed.
- Think of this as: "Everything is fine, traffic is flowing."
2. Open (Fault Detected)
- The failure threshold has been breached (e.g., 5 consecutive failures or 50% failure rate in the last 10 seconds).
- All requests are immediately rejected without even attempting to contact the failing service.
- The circuit breaker starts a timer (the "reset timeout").
- Think of this as: "The fuse has blown. We're not even trying to send electricity until the problem is fixed."
- Benefit: Fails fast -- users/callers get an immediate response (even if it's an error or fallback), rather than waiting for a timeout.
3. Half-Open (Testing Recovery)
- After the reset timeout expires, the breaker moves to Half-Open.
- It allows a small, controlled number of test requests to pass through.
- If these test requests succeed → breaker transitions back to Closed (service has recovered).
- If these test requests fail → breaker transitions back to Open and resets the timer.
- Think of this as: "Let me carefully test if the electricity is safe again before fully turning it back on."
Q3. How do you configure thresholds for a Circuit Breaker? What factors do you consider?
Answer:
Thresholds are configurable parameters that govern when the breaker opens, how long it stays open, and how it tests recovery. Key thresholds include:
| Parameter | Description | Example Value |
|---|---|---|
| Failure Rate Threshold | % of failed calls that triggers the breaker to open | 50% |
| Slow Call Rate Threshold | % of calls that are too slow (exceeds duration threshold) | 80% |
| Slow Call Duration Threshold | Duration above which a call is considered "slow" | 2 seconds |
| Minimum Number of Calls | Minimum calls needed before thresholds are evaluated (avoids false trips on small samples) | 10 calls |
| Wait Duration in Open State | How long the breaker stays open before moving to Half-Open | 30 seconds |
| Permitted Calls in Half-Open | Number of test calls allowed in Half-Open state | 3 calls |
Factors to consider:
- Nature of the service: A payment service needs tighter thresholds (less tolerance) than a recommendation service.
- SLA requirements: If your SLA is 99.9% uptime, your breaker should protect it aggressively.
- Traffic volume: High traffic services need sliding window evaluation (count-based or time-based).
- Recovery speed of the dependency: If the dependency recovers quickly, keep the reset timeout short.
Q4. What is the difference between Circuit Breaker, Retry, and Timeout patterns?
Answer:
These three patterns are often used together and complement each other, but they solve different problems:
| Pattern | Purpose | When to Use |
|---|---|---|
| Timeout | Stop waiting after a set time for a response | Always -- prevents indefinite blocking |
| Retry | Automatically retry a failed operation a few times | For transient/temporary failures (e.g., brief network hiccup) |
| Circuit Breaker | Stop retrying after repeated failures; fail fast | For sustained/repeated failures (service is down) |
Key Difference -- Circuit Breaker vs Retry:
- The Retry pattern assumes failures are transient and temporary. It will keep retrying, which can amplify the load on an already-struggling service ("retry storm").
- The Circuit Breaker pattern recognizes when a service is consistently failing and stops sending requests altogether, giving the service time to recover.
- Best Practice: Use them together -- Retry for transient failures, Circuit Breaker to stop retrying when failures are persistent.
Example Flow:
Request → Timeout (2s) → Retry (up to 3 times) → Circuit Breaker (open after 5 failures) → Fallback
Q5. What are fallback strategies in a Circuit Breaker?
Answer:
A fallback is the response or action taken when the circuit breaker is open and the primary service call is not attempted. Good fallback strategies include:
- Return a cached/stale response: Serve the last known good data. Good for read-heavy scenarios (e.g., product catalog, user profile).
- Return a default/safe response: Return a sensible default (e.g., empty list, zero, or a placeholder message).
- Call a backup/secondary service: Route to a simpler or degraded service that can still partially serve the request.
- Queue the request: Store the request and process it later when the service recovers (good for non-time-sensitive writes).
- Return a user-friendly error message: Inform the user that the feature is temporarily unavailable, rather than showing a cryptic error.
Which fallback to choose?
- Consider data freshness requirements -- if stale data is dangerous (e.g., stock price), don't cache it.
- Consider user experience -- silent degradation is better than a crash.
- Consider business impact -- a payment service needs a different fallback than a "recommended videos" widget.
Q6. How does a Circuit Breaker differ from a Load Balancer?
Answer:
- A Load Balancer distributes incoming traffic across multiple instances of a service to balance the load. It does NOT detect or handle application-level failures; it routes traffic.
- A Circuit Breaker monitors the health of a downstream service and stops sending requests when that service is failing, regardless of how many instances exist.
- They are complementary: the load balancer distributes load, and the circuit breaker protects against a service that is consistently failing across all its instances.
Q7. How would you monitor a Circuit Breaker in production?
Answer:
Monitoring is critical. Key metrics to track:
- State transitions: How often does the breaker move from Closed → Open? Frequent transitions signal instability.
- Failure rate: What percentage of calls are failing? Rising failure rate is a leading indicator.
- Fallback invocation count: How often is the fallback being used? High count = the breaker is frequently open.
- Response time (P50, P95, P99): Slow calls can trigger breakers even without hard failures.
- Recovery time: How long does the service stay in Open state before recovering?
Tools:
- Prometheus + Grafana: Standard for exposing and visualizing circuit breaker metrics.
- Resilience4j Actuator (Java): Exposes metrics via Spring Boot Actuator endpoints.
- Datadog / New Relic / Dynatrace: APM tools with built-in circuit breaker visibility.
- PagerDuty / OpsGenie: Alert on breaker state changes.
Q8. Can you walk through a real-world scenario where a Circuit Breaker saved the day?
Answer:
Scenario: E-Commerce Checkout Service
An e-commerce platform has a Checkout Service that calls an external Payment Gateway API. During a Black Friday sale, the Payment Gateway starts experiencing heavy load and begins timing out.
Without Circuit Breaker:
- Every checkout attempt waits 30 seconds for the Payment Gateway to respond (or time out).
- Thousands of users are clicking "Pay", creating thousands of threads waiting.
- The Checkout Service runs out of threads (thread pool exhaustion).
- The Checkout Service itself crashes.
- Other services depending on Checkout (Order Management, Inventory) also fail.
- The entire platform goes down -- massive revenue loss.
With Circuit Breaker:
- The first 5 calls to Payment Gateway timeout → breaker opens.
- All subsequent payment requests fail fast (in milliseconds) with a friendly message: "Payment processing is temporarily unavailable. Please try again in a few minutes."
- The Checkout Service stays healthy and responsive.
- After 30 seconds, the breaker moves to Half-Open, tests a few calls.
- When the Payment Gateway recovers, the breaker closes and normal operation resumes.
- Result: The system degraded gracefully instead of crashing.
Q9. How does Circuit Breaker fit into microservices design principles?
Answer:
The Circuit Breaker Pattern directly supports several key microservices design principles:
- Fault Isolation: Failures in one service don't cascade to others.
- Resilience: The system continues to function (in a degraded mode) even when components fail.
- Fail Fast: Rather than waiting for timeouts, the system responds immediately.
- Graceful Degradation: Users still get a partial or fallback experience rather than a full outage.
- Observability: State transitions and metrics provide insight into system health.
It is a core part of resilience patterns alongside Bulkhead, Retry, Timeout, and Rate Limiter.
Q10. What is the Bulkhead Pattern and how does it relate to Circuit Breaker?
Answer:
The Bulkhead Pattern (named after ship hull compartments) isolates resources (thread pools, connection pools, semaphores) for different services or operations, so that a failure in one doesn't exhaust resources for others.
Relationship to Circuit Breaker:
- Circuit Breaker stops sending requests to a failing service.
- Bulkhead limits HOW MANY concurrent requests can go to a service at once.
- Together, they form a strong defense: Bulkhead prevents resource exhaustion; Circuit Breaker stops unnecessary calls.
Example:
- Without Bulkhead: A slow Payment service consumes all 200 threads in your app, leaving nothing for other services.
- With Bulkhead: Payment service is limited to 20 threads. If it's slow, only those 20 threads are affected. Other services still have 180 threads.
Common Pitfalls: Extensive Guide
Pitfall 1: Setting Thresholds Without Understanding Traffic Patterns
The Problem:
Many engineers set arbitrary thresholds (e.g., "open after 5 failures") without analyzing actual traffic patterns. In a high-traffic system, 5 failures per second might be normal noise; in a low-traffic system, 5 failures might mean total outage.
Consequence:
- Too sensitive (low threshold): Breaker opens on minor, transient hiccups → users see errors even when the service is mostly fine. Known as "flapping" -- breaker constantly opens and closes.
- Too lenient (high threshold): Breaker takes too long to open → many users hit the failing service, wasting resources and degrading experience.
Solution:
- Use percentage-based thresholds (e.g., 50% failure rate) rather than absolute counts, combined with a minimum call volume requirement.
- Analyze baseline metrics (P99 response time, failure rate) before setting thresholds.
- Use sliding window evaluation (count-based or time-based) for more accurate measurement.
Pitfall 2: No Fallback Strategy (Failing Without Graceful Degradation)
The Problem:
The circuit breaker opens, all requests are rejected, but there is no fallback. The caller receives a raw exception or a cryptic 500 error.
Consequence:
Users see an ugly error page. The calling service may also fail, propagating the error upward. The purpose of the circuit breaker (protecting the system) is partially defeated because the upstream caller still breaks.
Solution:
Always define a fallback for every circuit breaker:
- Cached data, default response, secondary service, or a user-friendly degraded response.
- Treat fallback logic as a first-class citizen in your design, not an afterthought.
Pitfall 3: Using a Single Circuit Breaker for Everything
The Problem:
Some teams use one global circuit breaker for all outbound calls, or per-service without per-operation granularity.
Consequence:
If your User Service has both a getProfile() (fast, read-only) and a generateReport() (slow, heavy) endpoint, failures in generateReport() will trip the breaker and block getProfile() calls too -- even though getProfile() works fine.
Solution:
Use per-operation or per-endpoint circuit breakers, not just per-service. This gives finer control and prevents healthy operations from being blocked by failing ones.
Pitfall 4: Not Testing Circuit Breaker Behavior (Lack of Chaos Engineering)
The Problem:
Circuit breakers are configured but never actually tested in a realistic failure scenario. Teams assume the circuit breaker will work as expected.
Consequence:
When a real outage occurs, the breaker may not behave as expected -- incorrect configuration, wrong exception types being caught, or fallback logic that itself has bugs.
Solution:
- Practice Chaos Engineering (e.g., using tools like Chaos Monkey, Gremlin, or Toxiproxy) to simulate downstream failures in staging/pre-production.
- Write integration tests that specifically test circuit breaker state transitions.
- Regularly drill failure scenarios so the team understands how the system behaves.
Pitfall 5: Ignoring Slow Calls (Only Watching for Errors)
The Problem:
Many implementations only trip the breaker on exceptions (hard failures). They ignore slow calls -- requests that succeed eventually but take 30 seconds to respond.
Consequence:
A service that is very slow (but not outright failing) can consume all thread pool resources, effectively causing an outage, while the circuit breaker remains happily closed because there are "no errors."
Solution:
Configure Slow Call Rate Threshold alongside failure rate. For example: "If more than 60% of calls take longer than 2 seconds, open the circuit." Libraries like Resilience4j support this natively.
Pitfall 6: Not Propagating the Correct Exceptions
The Problem:
By default, circuit breakers may catch all exceptions. However, some exceptions (like a 400 Bad Request, meaning the caller sent invalid data) should NOT count as failures for the circuit breaker -- they are caller errors, not service failures.
Consequence:
A misconfigured client that always sends bad requests trips the circuit breaker, blocking all other valid callers -- even though the service itself is perfectly healthy.
Solution:
Configure the circuit breaker to only record specific exceptions as failures (e.g., network timeouts, 5xx errors) and to ignore others (e.g., 4xx client errors, IllegalArgumentException). Resilience4j and other libraries support ignoreExceptions and recordExceptions configurations.
Pitfall 7: Not Considering Thread Safety and Distributed State
The Problem:
In a single-instance application, circuit breaker state is in-memory and straightforward. In a distributed environment (multiple instances of the same service), each instance has its own circuit breaker state.
Consequence:
Instance A might have the breaker open (it experienced failures), while Instance B still has it closed (it hasn't hit the threshold yet). Half the traffic is still hitting the failing service.
Solution:
- Accept that per-instance breakers are a valid approach in most cases -- eventually, all instances will trip.
- For stricter coordination, use a distributed circuit breaker that shares state via a cache (Redis) or a service mesh (like Istio), which can implement circuit breaking at the network/proxy level, consistent across all instances.
- Service meshes (Istio, Linkerd): Implement circuit breaking at the infrastructure level, so it's consistent across all instances automatically.
Pitfall 8: Over-relying on Circuit Breaker as a Silver Bullet
The Problem:
Teams implement circuit breakers and consider resilience "done."
Consequence:
Circuit breaker is one tool in the resilience toolbox. Without proper timeout configuration, bulkheads, rate limiters, retries, and health checks, the system is still vulnerable.
Solution:
Adopt a layered resilience strategy:
- Timeout: Never wait forever.
- Retry: Handle transient failures.
- Bulkhead: Isolate resources.
- Circuit Breaker: Stop cascading failures.
- Rate Limiter: Prevent overloading.
- Fallback: Degrade gracefully.
- Health Checks & Monitoring: Know when things go wrong.
Pitfall 9: Forgetting to Log and Alert on State Changes
The Problem:
The circuit breaker trips silently. Engineers don't know it's open until a user complains or a monitor detects elevated error rates.
Consequence:
Delayed incident response. Root cause analysis is harder without a timeline of breaker state changes.
Solution:
- Register event listeners on circuit breaker state transitions (most libraries support callbacks/events).
- Log every state change (Closed → Open, Open → Half-Open, Half-Open → Closed) with timestamps and failure details.
- Send alerts to on-call engineers when a breaker opens.
- Include breaker state in your health check endpoints (e.g.,
/actuator/healthin Spring Boot shows circuit breaker status).
Pitfall 10: Applying Circuit Breaker to Internal, In-Process Calls
The Problem:
Some engineers apply circuit breakers to internal method calls or in-process communication within the same service.
Consequence:
Unnecessary overhead and complexity. Circuit breakers are designed for unreliable, external network calls. Internal method calls that fail should either be handled with standard error handling, or indicate a code bug that should be fixed -- not worked around with a circuit breaker.
Solution:
Reserve circuit breakers for external service calls -- HTTP APIs, database connections, message queues, external SDKs, file system operations to remote shares, etc.
Circuit Breaker in Java
Java has the richest ecosystem for circuit breakers, driven largely by the microservices revolution and Netflix's early adoption.
Libraries & Frameworks
| Library | Description | Status |
|---|---|---|
| Resilience4j | Lightweight, functional, no external dependencies. Current industry standard. | ✅ Actively maintained |
| Spring Cloud Circuit Breaker | Abstraction layer; works with Resilience4j, Sentinel. Integrates with Spring ecosystem. | ✅ Actively maintained |
| Hystrix | Netflix's original library. Battle-tested at massive scale. | ⚠ Maintenance mode (no new features) |
| Sentinel | Alibaba's resilience library with circuit breaking + flow control. | ✅ Actively maintained |
| Failsafe | Lightweight, composable resilience library. | ✅ Actively maintained |
Example 1: Basic Resilience4j Circuit Breaker
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
public class PaymentService {
private final CircuitBreaker circuitBreaker;
private final ExternalPaymentClient paymentClient;
public PaymentService(ExternalPaymentClient paymentClient) {
this.paymentClient = paymentClient;
// Configure the circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.slidingWindowSize(10) // Evaluate last 10 calls
.failureRateThreshold(50) // Open if 50%+ calls fail
.slowCallRateThreshold(80) // Open if 80%+ calls are slow
.slowCallDurationThreshold(Duration.ofSeconds(2)) // "Slow" = > 2s
.waitDurationInOpenState(Duration.ofSeconds(30)) // Stay open 30s
.permittedNumberOfCallsInHalfOpenState(3) // 3 test calls in half-open
.minimumNumberOfCalls(5) // Need at least 5 calls to evaluate
.recordExceptions(IOException.class, TimeoutException.class) // These count as failures
.ignoreExceptions(IllegalArgumentException.class) // These do NOT count as failures
.build();
CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
this.circuitBreaker = registry.circuitBreaker("paymentService");
// Register event listeners to log state changes
circuitBreaker.getEventPublisher()
.onStateTransition(event ->
System.out.println("Circuit Breaker state changed: " + event.getStateTransition()))
.onError(event ->
System.out.println("Circuit Breaker recorded failure: " + event.getThrowable().getMessage()))
.onCallNotPermitted(event ->
System.out.println("Circuit Breaker is OPEN -- call rejected immediately"));
}
public String processPayment(String orderId, double amount) {
// Wrap the call with circuit breaker + fallback
return circuitBreaker.executeSupplier(
() -> paymentClient.charge(orderId, amount), // Primary: call external service
throwable -> getFallbackResponse(orderId) // Fallback: called when breaker is open
);
}
private String getFallbackResponse(String orderId) {
// Options: return cached data, queue for later, return friendly error
return "Payment queued for order " + orderId + ". Will be processed shortly.";
}
}Example 2: Spring Boot with Resilience4j Annotations
The most common real-world Java usage -- declarative circuit breakers via annotations.
application.yml configuration:
resilience4j:
circuitbreaker:
instances:
paymentService:
sliding-window-type: COUNT_BASED
sliding-window-size: 10
failure-rate-threshold: 50
slow-call-rate-threshold: 80
slow-call-duration-threshold: 2s
wait-duration-in-open-state: 30s
permitted-number-of-calls-in-half-open-state: 3
minimum-number-of-calls: 5
register-health-indicator: true # Exposes state via /actuator/health
inventoryService:
sliding-window-type: TIME_BASED
sliding-window-size: 60 # Last 60 seconds
failure-rate-threshold: 40
wait-duration-in-open-state: 60sService class:
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import org.springframework.stereotype.Service;
@Service
public class OrderService {
private final PaymentClient paymentClient;
private final InventoryClient inventoryClient;
public OrderService(PaymentClient paymentClient, InventoryClient inventoryClient) {
this.paymentClient = paymentClient;
this.inventoryClient = inventoryClient;
}
// @CircuitBreaker annotation wraps this method
// "fallbackMethod" is called when the circuit is OPEN or the call fails
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResult processPayment(Order order) {
return paymentClient.charge(order.getId(), order.getAmount());
}
// Fallback method -- MUST have the same return type and same parameters + a Throwable
public PaymentResult paymentFallback(Order order, Throwable throwable) {
if (throwable instanceof CallNotPermittedException) {
// Circuit is OPEN -- fail fast response
return PaymentResult.queued("Payment service temporarily unavailable. Order saved.");
}
// Circuit is CLOSED but call failed
return PaymentResult.failed("Payment failed. Please retry.");
}
@CircuitBreaker(name = "inventoryService", fallbackMethod = "inventoryFallback")
public InventoryStatus checkInventory(String productId) {
return inventoryClient.getStock(productId);
}
public InventoryStatus inventoryFallback(String productId, Throwable throwable) {
// Return optimistic cached response -- assume in stock for better UX
return InventoryStatus.assumeAvailable(productId);
}
}Example 3: Monitoring Circuit Breaker State in Spring Boot
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.HashMap;
import java.util.Map;
@RestController
public class HealthController {
private final CircuitBreakerRegistry circuitBreakerRegistry;
public HealthController(CircuitBreakerRegistry circuitBreakerRegistry) {
this.circuitBreakerRegistry = circuitBreakerRegistry;
}
@GetMapping("/circuit-breaker-status")
public Map<String, Object> getCircuitBreakerStatus() {
Map<String, Object> status = new HashMap<>();
circuitBreakerRegistry.getAllCircuitBreakers().forEach(cb -> {
Map<String, Object> cbInfo = new HashMap<>();
cbInfo.put("state", cb.getState());
cbInfo.put("failureRate", cb.getMetrics().getFailureRate() + "%");
cbInfo.put("slowCallRate", cb.getMetrics().getSlowCallRate() + "%");
cbInfo.put("bufferedCalls", cb.getMetrics().getNumberOfBufferedCalls());
cbInfo.put("failedCalls", cb.getMetrics().getNumberOfFailedCalls());
cbInfo.put("successfulCalls", cb.getMetrics().getNumberOfSuccessfulCalls());
status.put(cb.getName(), cbInfo);
});
return status;
}
}Example 4: Combining Circuit Breaker + Retry + Timeout (Full Resilience Stack)
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import io.github.resilience4j.retry.annotation.Retry;
import io.github.resilience4j.timelimiter.annotation.TimeLimiter;
import java.util.concurrent.CompletableFuture;
@Service
public class ResilientExternalService {
// Order of decorators (outermost to innermost): TimeLimiter → CircuitBreaker → Retry
// 1. TimeLimiter: cancels if takes > 2 seconds
// 2. CircuitBreaker: stops calls if too many failures
// 3. Retry: retries up to 3 times for transient failures
@TimeLimiter(name = "externalService")
@CircuitBreaker(name = "externalService", fallbackMethod = "fallback")
@Retry(name = "externalService")
public CompletableFuture<String> callExternalService(String request) {
return CompletableFuture.supplyAsync(() -> externalClient.call(request));
}
public CompletableFuture<String> fallback(String request, Throwable t) {
return CompletableFuture.completedFuture("Fallback: Service temporarily unavailable.");
}
}application.yml for the full stack:
resilience4j:
timelimiter:
instances:
externalService:
timeout-duration: 2s
retry:
instances:
externalService:
max-attempts: 3
wait-duration: 500ms
retry-exceptions:
- java.net.ConnectException
- java.net.SocketTimeoutException
circuitbreaker:
instances:
externalService:
failure-rate-threshold: 50
wait-duration-in-open-state: 30s
sliding-window-size: 10Maven Dependency (Resilience4j with Spring Boot)
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId> <!-- Required for annotations -->
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId> <!-- For health endpoints -->
</dependency>Circuit Breaker in Python
Python's ecosystem offers both synchronous and asynchronous circuit breaker solutions, important for modern Python services (FastAPI, Django, async pipelines).
Libraries
| Library | Best For | Async Support |
|---|---|---|
| pybreaker | Synchronous services, simple use cases | ❌ |
| aiobreaker | Async Python (FastAPI, aiohttp, asyncio) | ✅ |
| tenacity | Retry with circuit-breaker-like behavior | Partial |
| circuitbreaker | Decorator-based, simple API | ❌ |
| Custom implementation | Full control, specific requirements | ✅ |
Example 1: pybreaker (Synchronous)
import pybreaker
import requests
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO)
# ─── Custom Listener for logging state transitions ───────────────────────────
class CircuitBreakerLogger(pybreaker.CircuitBreakerListener):
def state_change(self, cb, old_state, new_state):
logging.warning(
f"[{datetime.now()}] Circuit Breaker '{cb.name}' changed: "
f"{old_state.name} → {new_state.name}"
)
def failure(self, cb, exc):
logging.error(f"Circuit Breaker '{cb.name}' recorded failure: {exc}")
def success(self, cb):
logging.info(f"Circuit Breaker '{cb.name}' recorded success.")
# ─── Configure Circuit Breaker ────────────────────────────────────────────────
payment_breaker = pybreaker.CircuitBreaker(
fail_max=5, # Open after 5 consecutive failures
reset_timeout=30, # Stay open for 30 seconds before trying again
name="payment_service",
listeners=[CircuitBreakerLogger()]
)
# ─── Protected function ───────────────────────────────────────────────────────
@payment_breaker
def call_payment_api(order_id: str, amount: float) -> dict:
response = requests.post(
"https://payment-gateway.example.com/charge",
json={"order_id": order_id, "amount": amount},
timeout=3 # Always set a timeout!
)
response.raise_for_status() # Raises HTTPError for 5xx → counts as failure
return response.json()
# ─── Service with fallback ────────────────────────────────────────────────────
def process_payment(order_id: str, amount: float) -> dict:
try:
return call_payment_api(order_id, amount)
except pybreaker.CircuitBreakerError:
# Circuit is OPEN -- fail fast
logging.warning(f"Payment service circuit OPEN. Queuing order {order_id}.")
return {
"status": "queued",
"message": "Payment service temporarily unavailable. Your order is saved.",
"order_id": order_id
}
except requests.exceptions.Timeout:
# Timeout happened (and counted as a failure by pybreaker)
return {"status": "error", "message": "Payment request timed out."}
except requests.exceptions.HTTPError as e:
if e.response.status_code == 400:
# 400 Bad Request = client error, NOT a service failure
# Note: pybreaker still counts this because raise_for_status() raises for all 4xx/5xx
# In production, you'd filter 4xx exceptions to not count as failures
return {"status": "error", "message": "Invalid payment request."}
raiseExample 2: aiobreaker (Asynchronous -- FastAPI)
Modern Python services use async/await. aiobreaker is the async equivalent of pybreaker.
import aiobreaker
import httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio
import logging
app = FastAPI()
# ─── Configure Async Circuit Breaker ─────────────────────────────────────────
inventory_breaker = aiobreaker.CircuitBreaker(
fail_max=5,
reset_timeout=30,
name="inventory_service"
)
# ─── Async protected function ─────────────────────────────────────────────────
@inventory_breaker
async def fetch_inventory(product_id: str) -> dict:
async with httpx.AsyncClient(timeout=2.0) as client:
response = await client.get(
f"http://inventory-service/api/products/{product_id}/stock"
)
response.raise_for_status()
return response.json()
# ─── FastAPI endpoint with fallback ──────────────────────────────────────────
class ProductRequest(BaseModel):
product_id: str
@app.get("/products/{product_id}/availability")
async def check_availability(product_id: str):
try:
inventory = await fetch_inventory(product_id)
return {"product_id": product_id, "in_stock": inventory["quantity"] > 0}
except aiobreaker.CircuitBreakerError:
# Circuit is OPEN -- return optimistic fallback
logging.warning(f"Inventory circuit OPEN for product {product_id}")
return {
"product_id": product_id,
"in_stock": True, # Optimistic assumption
"note": "Availability estimate -- real-time check temporarily unavailable"
}
except Exception as e:
raise HTTPException(status_code=503, detail="Inventory service unavailable")Example 3: Building a Custom Circuit Breaker from Scratch (Python)
Understanding the internals by building one yourself -- great for interviews and learning:
import time
import threading
from enum import Enum
from functools import wraps
from typing import Callable, Optional, Type, Tuple
class CircuitState(Enum):
CLOSED = "CLOSED"
OPEN = "OPEN"
HALF_OPEN = "HALF_OPEN"
class CircuitBreakerOpenError(Exception):
"""Raised when the circuit breaker is open and call is not permitted."""
pass
class CircuitBreaker:
"""
A thread-safe implementation of the Circuit Breaker pattern.
"""
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
half_open_max_calls: int = 3,
expected_exception: Tuple[Type[Exception], ...] = (Exception,)
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self.expected_exception = expected_exception
# State
self._state = CircuitState.CLOSED
self._failure_count = 0
self._last_failure_time: Optional[float] = None
self._half_open_calls = 0
# Thread safety
self._lock = threading.Lock()
@property
def state(self) -> CircuitState:
with self._lock:
# Check if we should transition from OPEN → HALF_OPEN
if (self._state == CircuitState.OPEN and
self._last_failure_time is not None and
time.time() - self._last_failure_time >= self.recovery_timeout):
self._state = CircuitState.HALF_OPEN
self._half_open_calls = 0
print(f"[CircuitBreaker] State → HALF_OPEN (testing recovery)")
return self._state
def _record_success(self):
with self._lock:
if self._state == CircuitState.HALF_OPEN:
self._half_open_calls += 1
if self._half_open_calls >= self.half_open_max_calls:
# Enough test calls succeeded -- fully recover
self._state = CircuitState.CLOSED
self._failure_count = 0
print(f"[CircuitBreaker] State → CLOSED (service recovered ✅)")
elif self._state == CircuitState.CLOSED:
self._failure_count = 0 # Reset on success
def _record_failure(self):
with self._lock:
self._failure_count += 1
self._last_failure_time = time.time()
if self._state == CircuitState.HALF_OPEN:
# Test call failed -- go back to OPEN
self._state = CircuitState.OPEN
print(f"[CircuitBreaker] State → OPEN (test call failed, service still down ❌)")
elif (self._state == CircuitState.CLOSED and
self._failure_count >= self.failure_threshold):
# Threshold exceeded -- trip the breaker
self._state = CircuitState.OPEN
print(f"[CircuitBreaker] State → OPEN "
f"(failure threshold {self.failure_threshold} exceeded ❌)")
def call(self, func: Callable, *args, fallback: Callable = None, **kwargs):
"""
Execute func with circuit breaker protection.
If the circuit is open, call fallback (if provided) or raise CircuitBreakerOpenError.
"""
current_state = self.state # Property handles OPEN → HALF_OPEN transition
if current_state == CircuitState.OPEN:
print(f"[CircuitBreaker] Call REJECTED -- circuit is OPEN")
if fallback:
return fallback(*args, **kwargs)
raise CircuitBreakerOpenError("Circuit breaker is OPEN. Service unavailable.")
try:
result = func(*args, **kwargs)
self._record_success()
return result
except self.expected_exception as e:
self._record_failure()
if fallback and self.state == CircuitState.OPEN:
return fallback(*args, **kwargs)
raise
def __call__(self, func: Callable) -> Callable:
"""Allow using as a decorator: @circuit_breaker"""
@wraps(func)
def wrapper(*args, **kwargs):
return self.call(func, *args, **kwargs)
return wrapper
# ─── Usage Example ────────────────────────────────────────────────────────────
import random
breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=10)
def flaky_service(request_id: int) -> str:
"""Simulates an unreliable external service."""
if random.random() < 0.7: # 70% failure rate
raise ConnectionError(f"Connection refused for request {request_id}")
return f"Success for request {request_id}"
def fallback_response(request_id: int) -> str:
return f"Fallback: Using cached data for request {request_id}"
# Simulate requests
for i in range(15):
try:
result = breaker.call(flaky_service, i, fallback=fallback_response)
print(f"Request {i}: {result} | State: {breaker.state.value}")
except CircuitBreakerOpenError as e:
print(f"Request {i}: BLOCKED -- {e}")
except ConnectionError as e:
print(f"Request {i}: FAILED -- {e} | State: {breaker.state.value}")
time.sleep(0.5)Example 4: Circuit Breaker for AI/LLM API Calls (Python + OpenAI)
A practical example using circuit breakers with AI APIs -- highly relevant in 2024+:
import pybreaker
import openai
import logging
from functools import lru_cache
logging.basicConfig(level=logging.INFO)
# Circuit breaker for OpenAI API
openai_breaker = pybreaker.CircuitBreaker(
fail_max=3, # Open after 3 failures (API is expensive -- be conservative)
reset_timeout=60, # Wait 60 seconds before retrying
name="openai_api"
)
# Circuit breaker for fallback model (e.g., local Ollama)
local_model_breaker = pybreaker.CircuitBreaker(
fail_max=5,
reset_timeout=30,
name="local_model"
)
@openai_breaker
def call_openai(prompt: str) -> str:
"""Primary: Call OpenAI GPT-4."""
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
timeout=10
)
return response.choices[0].message.content
@local_model_breaker
def call_local_model(prompt: str) -> str:
"""Secondary fallback: Call local Ollama model."""
import httpx
response = httpx.post(
"http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": prompt},
timeout=15
)
response.raise_for_status()
return response.json()["response"]
@lru_cache(maxsize=100)
def get_cached_response(prompt_key: str) -> str:
"""Last resort: Return a cached generic response."""
return "I'm temporarily unable to process your request. Please try again shortly."
def resilient_llm_call(prompt: str) -> str:
"""
Multi-level fallback LLM call:
1. Try OpenAI GPT-4
2. If circuit open or failed → try local Ollama
3. If local also down → return cached/static response
"""
# Try primary (OpenAI)
try:
return call_openai(prompt)
except pybreaker.CircuitBreakerError:
logging.warning("OpenAI circuit OPEN. Falling back to local model.")
except Exception as e:
logging.error(f"OpenAI call failed: {e}")
# Try secondary (Local model)
try:
return call_local_model(prompt)
except pybreaker.CircuitBreakerError:
logging.warning("Local model circuit OPEN. Using cached response.")
except Exception as e:
logging.error(f"Local model call failed: {e}")
# Last resort: cached/static response
return get_cached_response(prompt[:50]) # Use first 50 chars as cache key
# ─── Test it ──────────────────────────────────────────────────────────────────
if __name__ == "__main__":
prompts = [
"Explain circuit breakers in simple terms.",
"What is the difference between retry and circuit breaker?",
"Give me an example of graceful degradation.",
]
for prompt in prompts:
response = resilient_llm_call(prompt)
print(f"Prompt: {prompt[:50]}...")
print(f"Response: {response[:100]}...\n")Circuit Breaker in Cloud-Native & Service Meshes
As systems move to Kubernetes and cloud-native architectures, circuit breaking can be implemented at the infrastructure level, not just the application level.
Application-Level vs Infrastructure-Level Circuit Breakers
| Level | Where It Lives | Tools | Scope |
|---|---|---|---|
| Application-Level | Inside your code | Resilience4j, pybreaker | Per-service, per-operation |
| Infrastructure-Level | Network proxy / sidecar | Istio, Envoy, Linkerd | Cross-service, language-agnostic |
Istio (Service Mesh) Circuit Breaker
Istio is the most widely adopted service mesh. It uses Envoy proxy sidecars injected into every pod. Circuit breaking is configured with YAML - no code changes needed.
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: payment-service-circuit-breaker
spec:
host: payment-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
maxRequestsPerConnection: 10
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 30Advantages of Infrastructure-Level Circuit Breaking:
- Language-agnostic: Works for Java, Python, Go, Node.js equally.
- Zero code changes: Pure configuration.
- Consistent across all services automatically.
- Centralized management.
Best Practice: Use both levels - Istio for infrastructure protection, Resilience4j/pybreaker for application-level business logic and custom fallbacks.
Quick Reference Cheat Sheet
The 3 States
| State | One Sentence |
|---|---|
| CLOSED | Normal operation - all calls pass through and failures are counted. |
| OPEN | Fault detected - all calls are immediately rejected. |
| HALF-OPEN | Recovery testing - a few test calls allowed; success closes, failure reopens. |
Key Configuration Parameters
| Parameter | What It Controls | Typical Value |
|---|---|---|
| failureRateThreshold | % failures to open the circuit | 50% |
| slowCallRateThreshold | % slow calls to open the circuit | 80% |
| slowCallDurationThreshold | What counts as "slow" | 2 seconds |
| minimumNumberOfCalls | Minimum calls to evaluate | 10 |
| waitDurationInOpenState | How long stays open | 30 seconds |
| permittedCallsInHalfOpenState | Test calls in half-open | 3 |
Common Pitfall Quick Reference
| Pitfall | Fix |
|---|---|
| Arbitrary thresholds | Use % + minimum call volume |
| No fallback | Define fallback as first-class requirement |
| One breaker for everything | Per-operation granularity |
| Never tested | Chaos engineering + integration tests |
| Only counting errors | Configure slow call thresholds too |
| Wrong exceptions counted | Use recordExceptions / ignoreExceptions |
| Silent failures | Log + alert on every state transition |
| Applied to internal calls | Only for external/network calls |
Pattern Comparison
| Pattern | Problem It Solves | Solution |
|---|---|---|
| Timeout | Waiting forever | Set a max wait time |
| Retry | Short random hiccup | Try again automatically |
| Circuit Breaker | Sustained repeated failure | Stop trying; wait and recover |
| Bulkhead | One service drains all resources | Isolate thread/connection pools |
| Rate Limiter | Too much outbound traffic | Limit requests per second |
| Fallback | Primary unavailable | Use secondary/cached/default |
Recommended Resources
- Book: "Release It!" by Michael T. Nygard
- Blog: Martin Fowler's Circuit Breaker article (martinfowler.com)
- Java: Resilience4j Documentation (resilience4j.readme.io)
- Python: pybreaker (github.com/danielfm/pybreaker)
- Cloud: Azure Architecture Center - Circuit Breaker Pattern
Conclusion
The Circuit Breaker Pattern is not just a coding technique - it is a philosophy of building resilient systems that accept failure as inevitable and prepare for it systematically.
Core Takeaways:
- Failure is normal. In distributed systems, services WILL fail. Design for it.
- Fail fast. A quick, informative failure is always better than a slow hang.
- Protect the whole. One failing service should never crash the entire system.
- Degrade gracefully. Users should always get something - even if it is a friendly degraded experience.
- Layer your defenses. Circuit breaker + timeout + retry + bulkhead + monitoring = truly resilient system.
- Observe and alert. A circuit breaker that opens silently is only half-useful.
- Test your failures. Practice chaos engineering before a real outage hits.
Whether you are building microservices in Java, AI pipelines in Python, cloud-native apps on Kubernetes, or integrating third-party APIs - the Circuit Breaker Pattern is one of the most valuable tools in your resilience engineering toolkit.
Remember the electrical analogy: You do not know how important your home circuit breaker is - until the day it saves your house from burning down.