SRE — Site Reliability Engineering¶
Site Reliability Engineering applies software engineering principles to infrastructure and operations problems. The goal: build and run systems that are scalable, reliable, and efficient.
Topics¶
| Topic | What you'll learn |
|---|---|
| Fundamentals | SRE vs DevOps, error budgets, toil |
| SLOs / SLIs / SLAs | Defining and measuring reliability |
| Observability | Metrics, logs, traces, dashboards |
| Incident Management | Detection, response, postmortems |
| On-Call & Runbooks | Rotations, alerting, runbook design |
| Scalability | Capacity planning, load testing, patterns |
| Case Studies | How Netflix, Google, Uber, Airbnb do SRE |
Learning Path¶
[B] Fundamentals → SLOs/SLIs/SLAs → Observability
[I] Incident Management → On-Call practices
[A] Scalability → Chaos Engineering → Error Budget policy
Key Resources¶
- awesome-sre — Curated reading list, tools, conference talks
- howtheysre — Real-world SRE at Netflix, Google, Uber, etc.
- sre-collection — Interview prep and job resources
- devops-exercises — Hands-on Q&A: Linux, Kubernetes, networking
- awesome-scalability — System design patterns
Essential Reading¶
- Google SRE Book (free online)
- Google SRE Workbook (free online)