Skip to content

SRE — Site Reliability Engineering

← Home

Site Reliability Engineering applies software engineering principles to infrastructure and operations problems. The goal: build and run systems that are scalable, reliable, and efficient.


Topics

Topic What you'll learn
Fundamentals SRE vs DevOps, error budgets, toil
SLOs / SLIs / SLAs Defining and measuring reliability
Observability Metrics, logs, traces, dashboards
Incident Management Detection, response, postmortems
On-Call & Runbooks Rotations, alerting, runbook design
Scalability Capacity planning, load testing, patterns
Case Studies How Netflix, Google, Uber, Airbnb do SRE

Learning Path

[B] Fundamentals → SLOs/SLIs/SLAs → Observability
[I] Incident Management → On-Call practices
[A] Scalability → Chaos Engineering → Error Budget policy

Key Resources

Essential Reading


Platform Engineering → | DBRE →