Real-World SRE Case Studies¶
All links sourced directly from howtheysre — no summaries invented.
Airbnb¶
Key topics: incident management, Kubernetes scaling, security, data protection
- Automated Incident Management Through Slack
- Alerting Framework at Airbnb
- When The Cloud Gets Dark — How Amazon's Outage Affected Airbnb
- Dynamic Kubernetes Cluster Scaling at Airbnb
- Production Secret Management at Airbnb
- Detecting Vulnerabilities With Vulnture
- Automating Data Protection at Scale, Part 1 · Part 2 · Part 3
Booking.com¶
Key topics: reliability/product collaboration, incident retrospectives, SLOs for data-intensive services
- How Reliability and Product Teams Collaborate at Booking.com
- Incidents, fixes, and the day after
- Troubleshooting: A journey into the unknown
- 📺 SLOs for Data-Intensive Services — SREcon19
- 📺 Sailing the Database Seas: Applying SRE Principles at Scale — SREcon24
Capital One¶
Key topics: chaos engineering, canary deployments, cloud resiliency, incident security
- The 3 R's of SREs: Resiliency, Recovery & Reliability
- 5 Steps to Getting Your App Chaos Ready
- Embrace the Chaos … Engineering
- 3 Lessons Learned From Implementing Chaos Engineering at Enterprise
- Continuous Chaos — Introducing Chaos Engineering into DevOps Practices
- Deploying with Confidence — Canary Deployments on AWS
- Architecting for Resiliency
- Capital One Data Breach — case study (MIT) — security incident post-analysis
Dropbox¶
Key topics: Python monolith to managed platform, build health automation, monitoring
- Atlas: Our journey from a Python monolith to a managed platform
- Monitoring server applications with Vortex
- Athena: Our automated build health management system
- SRE Career Framework
- 📺 Service Discovery Challenges at Scale — SREcon19
eBay¶
Key topics: Kafka DR, JVM triage, zero-downtime deployments, fault injection
- Resiliency and Disaster Recovery with Kafka
- SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue
- SRE Case Study: Mysterious Traffic Imbalance
- Zero Downtime, Instant Deployment and Rollback
- How eBay's Notification Platform Used Fault Injection in New Ways
Etsy¶
Key topics: blameless postmortems (origin), on-call measurement, high-traffic preparation
- Blameless PostMortems and a Just Culture — the founding post
- Etsy's Debriefing Facilitation Guide for Blameless Postmortems
- Opsweekly: Measuring on-call experience with alert classification
- How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020
- Measure Anything, Measure Everything
- Demystifying Site Outages
- 📺 Velocity 09: 10+ Deploys Per Day — Dev and Ops Cooperation at Flickr/Etsy — the talk that launched DevOps
GitHub¶
Key topics: availability reports (public), deployment reliability, ChatOps, on-call culture, OpenTelemetry
Engineering posts: - Deployment reliability at GitHub - Improving how we deploy GitHub - Building On-Call Culture at GitHub - Using ChatOps to help Actions on-call engineers - Why (and how) GitHub is adopting OpenTelemetry - Partitioning GitHub's relational databases to handle scale - Reducing flaky builds by 18x - MySQL High Availability at GitHub - How we improved availability through iterative simplification - How GitHub uses merge queue to ship hundreds of changes every day
Availability reports (public postmortems): - February 28th DDoS Incident Report - October 21 post-incident analysis - February service disruptions post-incident analysis - Monthly availability reports (2020–2024) — GitHub publishes these every month
Google¶
Key topics: SRE as a discipline, error budgets, SLOs, ML reliability, on-call scaling
- SRE Practices & Processes
- How SRE teams are organized, and how to get started
- Three months, 30x demand: How we scaled Google Meet during COVID-19
- Accelerating incident response using generative AI
- Google site reliability using Go
SREcon talks: - 📺 What's the Difference Between DevOps and SRE? — Seth Vargo & Liz Fong-Jones - 📺 Risk and Error Budgets — Seth Vargo & Liz Fong-Jones - 📺 Must Watch — Google SRE YouTube Playlist - 📺 Zero Touch Prod: Towards Safer Production Environments - 📺 Scaling SRE Organizations: The Journey from 1 to Many Teams - 📺 The Map Is Not the Territory: How SLOs Lead Us Astray
Pinterest¶
Key topics: Kubernetes scaling, distributed tracing, auto-scaling, CI performance, observability
- Scaling Kubernetes with Assurance at Pinterest
- Auto scaling Pinterest
- Distributed tracing at Pinterest with new open source tools
- How we designed our CI System to be more than 50% Faster
- Ensuring High Availability of Ads Realtime Streaming Services
- Upgrading Pinterest operational metrics
- 📺 Evolution of Observability Tools at Pinterest — SREcon19
Shopify¶
Key topics: high-traffic event resiliency, capacity planning, game days, ChatOps incidents, DNS
- Resiliency Planning for High-Traffic Events
- Capacity Planning at Scale
- Four Steps to Creating Effective Game Day Tests
- Implementing ChatOps into our Incident Management Procedure
- Using DNS Traffic Management to Add Resiliency to Shopify's Services
- StatsD at Shopify
- 📺 Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures — SREcon19
- 📺 Advanced Napkin Math: Estimating System Performance from First Principles — SREcon19
Slack¶
Key topics: public incident reports, chaos engineering, deploy pipeline, observability cost
- Slack's Incident on 2-22-22 — detailed public postmortem
- Slack's Outage on January 4th 2021 — detailed public postmortem
- A Terrible, Horrible, No-Good, Very Bad Day at Slack
- Disasterpiece Theater: Slack's process for approachable Chaos Engineering
- Deploys at Slack
- Infrastructure Observability for Changing the Spend Curve
- 📺 What Breaks Our Systems: A Taxonomy of Black Swans — SREcon19
Spotify¶
Key topics: Kubernetes developer experience, incident response automation, tracing performance
- Automated Incident Response Infrastructure in GCP
- Designing a Better Kubernetes Experience for Developers
- Techbytes: What The Industry Misses About Incidents and What You Can Do
- 📺 Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance — SREcon19
Stripe¶
Key topics: canonical log lines, observability, secure builds, metrics aggregation
- Fast and flexible observability with canonical log lines
- Fast builds, secure builds. Choose two.
- Introducing Veneur: high performance and global aggregation for Datadog
- 📺 How Stripe Invests in Technical Infrastructure — SREcon19
- 📺 The AWS Billing Machine and Optimizing Cloud Costs — SREcon19
Twitter¶
Key topics: microservices infrastructure, logging at scale, load balancing, metrics DB
- The Infrastructure Behind Twitter: Scale
- The infrastructure behind Twitter: efficiency and optimization
- Logging at Twitter: Updated
- MetricsDB: TimeSeries Database for storing metrics at Twitter
- Deterministic Aperture: A distributed, load balancing algorithm
- Deleting data distributed throughout your microservices architecture
Uber¶
Key topics: Kafka DR multi-region, Jaeger + M3 observability, on-call culture, failover
- Disaster Recovery for Multi-Region Kafka at Uber
- Optimizing Observability with Jaeger, M3, and XYS at Uber
- Engineering Failover Handling in Uber's Mobile Networking Infrastructure
- Founding Uber SRE
- 📺 A Tale of Two Rotations: Building a Humane & Effective On-Call — SREcon19
- 📺 Testing in Production at Scale — SREcon19
- 📺 A History of SRE at Uber
Udemy¶
Key topics: blameless incident reviews, build engineering, monitoring as a service
- Blameless Incident Reviews at Udemy
- How Udemy does Build Engineering
- 📺 How to Do SRE When You Have No SRE — SREcon19
Patterns That Emerge Across All Companies¶
Reading across the case studies, these appear consistently:
- Public postmortems build trust — GitHub, Slack, eBay all publish detailed incident analyses. Users respond positively.
- ChatOps is universal for incident response — Slack bots, Hubot, custom tooling. Every company eventually builds this.
- Blameless culture is a prerequisite — Etsy established this in 2012. Companies that skip it have recurring incidents.
- SLOs require product buy-in, not just engineering — every SLO rollout story involves convincing non-engineers first.
- Chaos engineering scales with organizational maturity — start small (game days), expand once culture handles failure well.
- Observability investment pays back immediately — every company that invested in tracing, metrics, and structured logs reduced MTTD.
- On-call health is a leading indicator — poor on-call → alert fatigue → missed signals → incidents.
Related Topics¶
- Incident Management
- Observability
- On-Call & Runbooks
- Scalability
- howtheysre — full company list (60+ companies)
- awesome-sre — curated blog posts and talks