SRE Fundamentals¶
What is SRE? [B]¶
SRE is what you get when you treat operations as a software engineering problem. Coined at Google, it replaces the traditional ops/dev wall with shared ownership of reliability.
Key principle: Reliability is a feature. It must be engineered, measured, and traded off against velocity.
SRE vs DevOps¶
| SRE | DevOps | |
|---|---|---|
| Origin | Community/industry | |
| Focus | Reliability metrics, error budgets | Culture, collaboration, automation |
| Role | Distinct SRE team | Embedded or shared responsibility |
| Prescriptiveness | High (specific practices) | Low (principles) |
SRE is an opinionated implementation of DevOps.
The Four Golden Signals [B]¶
Monitor these for any user-facing service:
- Latency — how long requests take (distinguish successful vs failed)
- Traffic — how much demand (RPS, QPS, concurrent users)
- Errors — rate of failed requests (explicit 5xx, implicit wrong data)
- Saturation — how "full" the service is (CPU, memory, queue depth)
→ See Observability for how to instrument these.
Error Budgets [B]¶
If your SLO is 99.9% availability → you have 0.1% error budget = ~43 min/month downtime.
- Budget remaining → ship features faster, take risks
- Budget exhausted → freeze releases, focus on reliability
Why this matters: It turns reliability into a shared business decision, not a blame game.
→ See SLOs / SLIs / SLAs for how to define and track these.
Toil [I]¶
Toil = manual, repetitive, automatable operational work that scales with traffic.
Characteristics of toil: - Manual - Repetitive - Automatable - Tactical (not strategic) - No enduring value - Grows as service grows
SRE goal: Keep toil < 50% of work. The rest = engineering (reducing future toil).
Common toil examples: - Manually restarting pods/services - Responding to false-positive alerts - Manual certificate rotations - Hand-editing config files per deployment
Reliability Hierarchy [I]¶
Before worrying about features, nail these in order:
- Monitoring — know when things break
- Incident response — fix things fast
- Postmortems — learn from failures (blameless)
- Testing & release — catch problems before prod
- Capacity planning — don't run out of runway
- Efficiency — do more with less
→ Incident Management | On-Call
Cognitive Load & Oncall Health [I]¶
Signs of an unhealthy SRE practice: - Alert fatigue (> 5 pages/shift) - No time for project work - Incidents repeat without postmortems - On-call == firefighting, not engineering
→ See On-Call & Runbooks
Chaos Engineering [A]¶
Intentionally inject failures to build confidence in the system's resilience.
Principles: 1. Define steady state (normal behavior) 2. Hypothesize it continues in both control and experiment groups 3. Introduce realistic failure variables (kill a pod, inject latency, drop a region) 4. Disprove the hypothesis
Tools: Chaos Monkey, Litmus, Gremlin, AWS Fault Injection Simulator
→ Related: Scalability