Cloud Common Failure Patterns and Rapid Recovery Playbook: Follow the Steps, Don’t Panic
Create Time:2026-06-02 15:44:38
浏览量
1009

Cloud Common Failure Patterns and Rapid Recovery Playbook: Follow the Steps, Don’t Panic

微信图片_2026-06-02_154258_530.png

It’s 3 AM. Your phone buzzes. “Order service is down.” You stumble to your laptop, open the dashboard, and stare at a wall of red graphs. Your mind goes blank. Where do you start? Logs? CPU? Should I restart?

Every operations engineer has been there. The failure itself isn’t the worst part. Panic is.

What if you had a playbook? Open the right page, follow the steps, recover. No guessing. No panic.

This is a collection of common cloud failure patterns and their rapid recovery procedures. Not theory. Actionable steps. When a failure hits, open the relevant section and follow the list.

01 Failure: Application Hangs (Process Alive, No Response)

Symptoms: Health checks fail or time out. Monitoring shows no response, but the process is still running.

Quick diagnosis:

  1. Try to access the service (curl / browser) – no response or timeout.

  2. ps aux | grep java – process exists.

  3. curl localhost:port/health – no response.

Recovery steps:

  1. Restart the service immediately. In Kubernetes, this is automatic if livenessProbe is configured.

  2. If restart doesn’t help, check logs for deadlocks or thread pool exhaustion.

  3. Remove the node from the load balancer so other nodes take traffic.

Verification: Health checks pass. Requests succeed.

Post‑recovery: Capture a thread dump (jstack) to analyse the deadlock or stuck code.

02 Failure: Database Connection Pool Exhaustion

Symptoms: Application logs show “Timeout waiting for connection.” Response times increase or time out.

Quick diagnosis:

  1. Check connection pool metrics – active connections near the maximum.

  2. Check database connections – are they close to the limit?

Recovery steps:

  1. Temporarily increase the connection pool limit (e.g., 20 → 50). This is a stop‑gap.

  2. Restart the application to release stuck connections.

  3. If the database limit is hit, temporarily increase max_connections on the database.

Verification: Application recovers. Active connections drop.

Post‑recovery: Look for connection leaks (unclosed connections). Check for slow queries holding connections too long.

03 Failure: Sustained High CPU

Symptoms: CPU >80% sustained. Response times increase.

Quick diagnosis:

  1. top – which process is consuming CPU?

  2. For Java, top -H -p <pid> to see which thread.

Recovery steps:

  1. If high CPU is due to normal business growth, scale out.

  2. If it’s abnormal (infinite loop, heavy computation), restart the service.

  3. Emergency workaround: temporarily reduce log level to reduce logging overhead.

Verification: CPU drops. Response times return to normal.

Post‑recovery: Use a flame graph to locate the hottest code path.

04 Failure: Disk Full

Symptoms: Disk usage >90%. Services may fail because they cannot write logs.

Quick diagnosis: df -h

Recovery steps:

  1. du -sh /* | sort -rh | head -10 – find the largest directories.

  2. Clean logs: find /var/log -type f -mtime +7 -delete

  3. Clean temporary files: /tmp, /var/tmp

  4. Expand the disk (most cloud providers support online expansion).

Verification: Disk usage drops. Services recover.

Post‑recovery: Check log rotation settings. Is logging too aggressive? Look for core dump files.

05 Failure: Memory Leak

Symptoms: Service becomes slow after running for a while. Restart fixes it temporarily, then it slows again.

Quick diagnosis:

  1. Memory usage rises continuously and never falls.

  2. After a restart, memory returns to normal, then climbs again.

Recovery steps:

  1. Restart the service – immediate recovery (temporary).

  2. Configure automatic restart (Kubernetes livenessProbe).

  3. If automatic restart isn’t possible, increase memory limit as a temporary buffer.

Verification: After restart, memory returns to normal.

Post‑recovery: Capture a heap dump and analyse which objects are leaking.

06 Failure: Downstream Dependency Timeout

Symptoms: API response time increases. A distributed trace shows high latency in a downstream call.

Quick diagnosis:

  1. In the trace, identify the slow downstream call.

  2. Call the downstream service directly to confirm slowness.

Recovery steps:

  1. Temporarily reduce the timeout so calls fail fast.

  2. Open a circuit breaker and fall back to a default value or cached response.

  3. Notify the downstream team.

Verification: API response time returns to normal (even if degraded data is returned).

Post‑recovery: Is the downstream service under‑provisioned? Network issue? Code bug?

07 Failure: Deadlock

Symptoms: Service is unresponsive but does not report errors. Threads appear stuck.

Quick diagnosis:

  1. Capture multiple thread dumps (jstack). Multiple threads are waiting for the same lock.

  2. Logs show no errors, but requests never return.

Recovery steps:

  1. Restart the service – the only fast remedy.

  2. If restart isn’t possible, try killing the blocked thread (not recommended – may leave inconsistent state).

Verification: After restart, the service recovers.

Post‑recovery: Analyse the code’s lock ordering to prevent circular waiting.

The Bottom Line

Failures are not the problem. Not knowing what to do when they happen is the problem.

Keep this playbook somewhere accessible. When an alert fires, don’t panic. Find the matching pattern. Follow the steps. Recover first. Investigate later.

That client’s ops lead put this playbook on the team wiki homepage. After every incident, they updated it. He said: “We used to rely on tribal knowledge. Now even a new engineer can follow the steps and recover.”

Does your team have a failure playbook yet?