Real-World SRE Case Studies¶

← SRE Home | ← Main

All links sourced directly from howtheysre — no summaries invented.

Airbnb¶

Key topics: incident management, Kubernetes scaling, security, data protection

Booking.com¶

Key topics: reliability/product collaboration, incident retrospectives, SLOs for data-intensive services

How Reliability and Product Teams Collaborate at Booking.com
Incidents, fixes, and the day after
Troubleshooting: A journey into the unknown
📺 SLOs for Data-Intensive Services — SREcon19
📺 Sailing the Database Seas: Applying SRE Principles at Scale — SREcon24

Capital One¶

Key topics: chaos engineering, canary deployments, cloud resiliency, incident security

Dropbox¶

Key topics: Python monolith to managed platform, build health automation, monitoring

eBay¶

Key topics: Kafka DR, JVM triage, zero-downtime deployments, fault injection

Etsy¶

Key topics: blameless postmortems (origin), on-call measurement, high-traffic preparation

Blameless PostMortems and a Just Culture — the founding post
Etsy's Debriefing Facilitation Guide for Blameless Postmortems
Opsweekly: Measuring on-call experience with alert classification
How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020
Measure Anything, Measure Everything
Demystifying Site Outages
📺 Velocity 09: 10+ Deploys Per Day — Dev and Ops Cooperation at Flickr/Etsy — the talk that launched DevOps

GitHub¶

Key topics: availability reports (public), deployment reliability, ChatOps, on-call culture, OpenTelemetry

Engineering posts: - Deployment reliability at GitHub - Improving how we deploy GitHub - Building On-Call Culture at GitHub - Using ChatOps to help Actions on-call engineers - Why (and how) GitHub is adopting OpenTelemetry - Partitioning GitHub's relational databases to handle scale - Reducing flaky builds by 18x - MySQL High Availability at GitHub - How we improved availability through iterative simplification - How GitHub uses merge queue to ship hundreds of changes every day

Availability reports (public postmortems): - February 28th DDoS Incident Report - October 21 post-incident analysis - February service disruptions post-incident analysis - Monthly availability reports (2020–2024) — GitHub publishes these every month

Google¶

Key topics: SRE as a discipline, error budgets, SLOs, ML reliability, on-call scaling

SREcon talks: - 📺 What's the Difference Between DevOps and SRE? — Seth Vargo & Liz Fong-Jones - 📺 Risk and Error Budgets — Seth Vargo & Liz Fong-Jones - 📺 Must Watch — Google SRE YouTube Playlist - 📺 Zero Touch Prod: Towards Safer Production Environments - 📺 Scaling SRE Organizations: The Journey from 1 to Many Teams - 📺 The Map Is Not the Territory: How SLOs Lead Us Astray

Pinterest¶

Key topics: Kubernetes scaling, distributed tracing, auto-scaling, CI performance, observability

Shopify¶

Key topics: high-traffic event resiliency, capacity planning, game days, ChatOps incidents, DNS

Slack¶

Key topics: public incident reports, chaos engineering, deploy pipeline, observability cost

Slack's Incident on 2-22-22 — detailed public postmortem
Slack's Outage on January 4th 2021 — detailed public postmortem
A Terrible, Horrible, No-Good, Very Bad Day at Slack
Disasterpiece Theater: Slack's process for approachable Chaos Engineering
Deploys at Slack
Infrastructure Observability for Changing the Spend Curve
📺 What Breaks Our Systems: A Taxonomy of Black Swans — SREcon19

Spotify¶

Key topics: Kubernetes developer experience, incident response automation, tracing performance

Stripe¶

Key topics: canonical log lines, observability, secure builds, metrics aggregation

Twitter¶

Key topics: microservices infrastructure, logging at scale, load balancing, metrics DB

Uber¶

Key topics: Kafka DR multi-region, Jaeger + M3 observability, on-call culture, failover

Udemy¶

Key topics: blameless incident reviews, build engineering, monitoring as a service

Blameless Incident Reviews at Udemy
How Udemy does Build Engineering
📺 How to Do SRE When You Have No SRE — SREcon19

Patterns That Emerge Across All Companies¶

Reading across the case studies, these appear consistently:

Public postmortems build trust — GitHub, Slack, eBay all publish detailed incident analyses. Users respond positively.
ChatOps is universal for incident response — Slack bots, Hubot, custom tooling. Every company eventually builds this.
Blameless culture is a prerequisite — Etsy established this in 2012. Companies that skip it have recurring incidents.
SLOs require product buy-in, not just engineering — every SLO rollout story involves convincing non-engineers first.
Chaos engineering scales with organizational maturity — start small (game days), expand once culture handles failure well.
Observability investment pays back immediately — every company that invested in tracing, metrics, and structured logs reduced MTTD.
On-call health is a leading indicator — poor on-call → alert fatigue → missed signals → incidents.

Incident Management
Observability
On-Call & Runbooks
Scalability
howtheysre — full company list (60+ companies)
awesome-sre — curated blog posts and talks