SRE — Site Reliability Engineering¶

← Home

Site Reliability Engineering applies software engineering principles to infrastructure and operations problems. The goal: build and run systems that are scalable, reliable, and efficient.

Topics¶

Topic	What you'll learn
Fundamentals	SRE vs DevOps, error budgets, toil
SLOs / SLIs / SLAs	Defining and measuring reliability
Observability	Metrics, logs, traces, dashboards
Incident Management	Detection, response, postmortems
On-Call & Runbooks	Rotations, alerting, runbook design
Scalability	Capacity planning, load testing, patterns
Case Studies	How Netflix, Google, Uber, Airbnb do SRE

Learning Path¶

[B] Fundamentals → SLOs/SLIs/SLAs → Observability
[I] Incident Management → On-Call practices
[A] Scalability → Chaos Engineering → Error Budget policy

Key Resources¶

awesome-sre — Curated reading list, tools, conference talks
howtheysre — Real-world SRE at Netflix, Google, Uber, etc.
sre-collection — Interview prep and job resources
devops-exercises — Hands-on Q&A: Linux, Kubernetes, networking
awesome-scalability — System design patterns

Essential Reading¶

Google SRE Book (free online)
Google SRE Workbook (free online)

Platform Engineering → | DBRE →