Skip to content

Real-World SRE Case Studies

← SRE Home | ← Main

All links sourced directly from howtheysre — no summaries invented.


Airbnb

Key topics: incident management, Kubernetes scaling, security, data protection


Booking.com

Key topics: reliability/product collaboration, incident retrospectives, SLOs for data-intensive services


Capital One

Key topics: chaos engineering, canary deployments, cloud resiliency, incident security


Dropbox

Key topics: Python monolith to managed platform, build health automation, monitoring


eBay

Key topics: Kafka DR, JVM triage, zero-downtime deployments, fault injection


Etsy

Key topics: blameless postmortems (origin), on-call measurement, high-traffic preparation


GitHub

Key topics: availability reports (public), deployment reliability, ChatOps, on-call culture, OpenTelemetry

Engineering posts: - Deployment reliability at GitHub - Improving how we deploy GitHub - Building On-Call Culture at GitHub - Using ChatOps to help Actions on-call engineers - Why (and how) GitHub is adopting OpenTelemetry - Partitioning GitHub's relational databases to handle scale - Reducing flaky builds by 18x - MySQL High Availability at GitHub - How we improved availability through iterative simplification - How GitHub uses merge queue to ship hundreds of changes every day

Availability reports (public postmortems): - February 28th DDoS Incident Report - October 21 post-incident analysis - February service disruptions post-incident analysis - Monthly availability reports (2020–2024) — GitHub publishes these every month


Google

Key topics: SRE as a discipline, error budgets, SLOs, ML reliability, on-call scaling

SREcon talks: - 📺 What's the Difference Between DevOps and SRE? — Seth Vargo & Liz Fong-Jones - 📺 Risk and Error Budgets — Seth Vargo & Liz Fong-Jones - 📺 Must Watch — Google SRE YouTube Playlist - 📺 Zero Touch Prod: Towards Safer Production Environments - 📺 Scaling SRE Organizations: The Journey from 1 to Many Teams - 📺 The Map Is Not the Territory: How SLOs Lead Us Astray


Pinterest

Key topics: Kubernetes scaling, distributed tracing, auto-scaling, CI performance, observability


Shopify

Key topics: high-traffic event resiliency, capacity planning, game days, ChatOps incidents, DNS


Slack

Key topics: public incident reports, chaos engineering, deploy pipeline, observability cost


Spotify

Key topics: Kubernetes developer experience, incident response automation, tracing performance


Stripe

Key topics: canonical log lines, observability, secure builds, metrics aggregation


Twitter

Key topics: microservices infrastructure, logging at scale, load balancing, metrics DB


Uber

Key topics: Kafka DR multi-region, Jaeger + M3 observability, on-call culture, failover


Udemy

Key topics: blameless incident reviews, build engineering, monitoring as a service


Patterns That Emerge Across All Companies

Reading across the case studies, these appear consistently:

  1. Public postmortems build trust — GitHub, Slack, eBay all publish detailed incident analyses. Users respond positively.
  2. ChatOps is universal for incident response — Slack bots, Hubot, custom tooling. Every company eventually builds this.
  3. Blameless culture is a prerequisite — Etsy established this in 2012. Companies that skip it have recurring incidents.
  4. SLOs require product buy-in, not just engineering — every SLO rollout story involves convincing non-engineers first.
  5. Chaos engineering scales with organizational maturity — start small (game days), expand once culture handles failure well.
  6. Observability investment pays back immediately — every company that invested in tracing, metrics, and structured logs reduced MTTD.
  7. On-call health is a leading indicator — poor on-call → alert fatigue → missed signals → incidents.