One Slow Service Took Down Everything? Downstream Fault Isolation and Avalanche Prevention

微信图片_2026-05-27_114144_033.png

Last year, during a flash sale, a client’s core transaction system crashed. The traffic wasn't the problem. The code wasn't the problem. The culprit was a non‑critical service – a user‑profile service – that had become slow.

Normally it responded in 20ms. During the sale, response time climbed to 800ms. The order service called it to get the user’s membership tier. The order service waited 800ms per call. Its thread pool filled up quickly. New requests queued. The services upstream of orders also began timing out. The whole checkout chain collapsed.

One non‑critical service, a few hundred milliseconds slower, and the entire site went down.

This is the under‑appreciated truth of microservice architectures: Your system crashes not always because of you. Often, it’s because of a downstream you didn’t protect against.

Today, let’s talk about why a slow downstream can kill your service, and how to build defenses that keep you standing when a dependency falters.

01 Slow Is More Dangerous Than Dead

Many people assume that if a downstream service fails, your call will just error out. A slow downstream is far more dangerous than a dead one.

Slowness is the silent killer. Here’s how it works:

Threads get stuck. The downstream call now takes 800ms instead of 20ms. Each thread that makes that call is held for much longer than usual. Threads aren't released quickly.
The thread pool saturates. New requests arrive, but there are no free threads. They queue or are rejected. Your own service’s response time begins to increase.
The failure propagates. Your service becomes slow, which affects the services that call you. The problem climbs up the call chain to the very entry point.

That client’s order service had a thread pool limit of 200. At normal downstream latency of 20ms, that was sufficient. When latency jumped to 800ms, the same request rate required far more threads. The pool saturated. New orders couldn’t be processed.

02 Timeout: Your First Line of Defence

Many developers set timeouts arbitrarily – 3 seconds, 5 seconds, 10 seconds. “A larger number can’t hurt.”

But a timeout that’s too long keeps threads waiting for too long. A timeout that’s too short kills healthy requests.

How to set it:

First, know your downstream’s normal response time – its P99. Set the timeout to 2‑3× that value.

Normal P99 = 50ms → Timeout = 100‑150ms
Normal P99 = 200ms → Timeout = 400‑600ms
Normal P99 = 1 second → Timeout = 2‑3 seconds

That client’s profile service had a normal P99 of 30ms. Their timeout was 3 seconds. When the service slowed to 800ms, calls stayed within the timeout and threads continued to wait. Changing the timeout to 100ms caused slow calls to fail fast, releasing threads immediately.

03 Retry: A Double‑Edged Sword

Retrying a failed downstream call can sometimes succeed on the second attempt. But misused, retries amplify failures.

Retry storm: A downstream service slows down. Callers start timing out. Each timed‑out caller retries. The downstream receives double the traffic. It slows further. More timeouts, more retries. Vicious cycle.

Retry design rules:

Limit retry count: At most 1‑2 retries. Never infinite.
Exponential backoff: First retry after 100ms, second after 200ms, third after 400ms.
Retry only on certain errors: Network timeouts, 5xx statuses. Do not retry on 4xx (business errors).
Pair with circuit breaker: After N consecutive failures, stop retrying and open the circuit.

That client’s service previously retried 3 times with no backoff – immediate retry. When the downstream slowed, the retry storm made it even slower. They changed to a single retry, exponential backoff, and a circuit breaker that opened after 5 consecutive failures.

04 Circuit Breaker: Cut the Line Automatically

A circuit breaker protects your system from being dragged down by a failing dependency.

Three states:

Closed: Calls go through. Failure counts are tracked.
Open: Failure rate exceeds a threshold. Calls fail immediately without invoking the dependency. A fallback is used instead.
Half‑open: After a sleep window, a small number of test calls are allowed. If they succeed, the breaker closes. If they fail, it stays open.

That client implemented a circuit breaker with a 50% failure threshold and a 10‑second statistical window. When the profile service slowed, the failure rate quickly exceeded 50% and the breaker opened. The order service stopped calling the slow dependency and used a fallback instead. The thread pool recovered.

05 Bulkhead: Stop One Failure from Spreading

Ships have bulkheads. If one compartment floods, the rest of the ship stays afloat. The same principle applies to software.

Thread pool isolation: Use separate thread pools for different downstream dependencies. Profile service calls use pool A. Order core calls use pool B. If profile service becomes slow, only pool A is affected. The core order path continues unaffected.

Semaphore isolation: Use a counter to limit concurrent calls – lightweight, no separate thread pool.

That client separated their thread pools: core dependencies (orders, inventory) into one pool, non‑core (profile, recommendations) into another. When the profile service slowed, only the non‑core pool was impacted. The order core path stayed fast.

06 Fallback: Degraded Service Beats No Service

When a downstream dependency fails or slows, returning a degraded response is better than waiting for a timeout and returning an error.

Fallback strategies:

Cached data: Profile service slow? Return the last cached membership tier.
Default value: Profile unavailable? Assume the user is a regular member.
Disable non‑core features: During peak load, turn off “you might also like” and keep only checkout.

That client implemented a fallback: when the profile service was unavailable, the membership tier defaulted to “regular.” Some users who deserved a higher discount didn’t get it, but they could still complete their orders – far better than the site being down.

The Bottom Line

Your system crashes not because your code is bad. It crashes because you didn’t build defences against the services you depend on.

That client’s ops lead later made a short mantra: “Timeout short to avoid queuing. Retry with backoff, not floods. Circuit breakers stop the bleeding. Bulkheads keep failures contained. Fallbacks give you something instead of nothing.”

Look at your downstream calls today. Do you have timeouts set? Are retries safe? Do you have circuit breakers? Are dependencies isolated? Do you have fallbacks?

If not, fix them. Next time a downstream service slows, your system will stay on its feet.