SRE Fundamentals¶

What is SRE? `[B]`¶

SRE is what you get when you treat operations as a software engineering problem. Coined at Google, it replaces the traditional ops/dev wall with shared ownership of reliability.

Key principle: Reliability is a feature. It must be engineered, measured, and traded off against velocity.

SRE vs DevOps¶

	SRE	DevOps
Origin	Google	Community/industry
Focus	Reliability metrics, error budgets	Culture, collaboration, automation
Role	Distinct SRE team	Embedded or shared responsibility
Prescriptiveness	High (specific practices)	Low (principles)

SRE is an opinionated implementation of DevOps.

The Four Golden Signals `[B]`¶

Monitor these for any user-facing service:

Latency — how long requests take (distinguish successful vs failed)
Traffic — how much demand (RPS, QPS, concurrent users)
Errors — rate of failed requests (explicit 5xx, implicit wrong data)
Saturation — how "full" the service is (CPU, memory, queue depth)

→ See Observability for how to instrument these.

Error Budgets `[B]`¶

If your SLO is 99.9% availability → you have 0.1% error budget = ~43 min/month downtime.

Budget remaining → ship features faster, take risks
Budget exhausted → freeze releases, focus on reliability

Why this matters: It turns reliability into a shared business decision, not a blame game.

→ See SLOs / SLIs / SLAs for how to define and track these.

Toil `[I]`¶

Toil = manual, repetitive, automatable operational work that scales with traffic.

Characteristics of toil: - Manual - Repetitive - Automatable - Tactical (not strategic) - No enduring value - Grows as service grows

SRE goal: Keep toil < 50% of work. The rest = engineering (reducing future toil).

Common toil examples: - Manually restarting pods/services - Responding to false-positive alerts - Manual certificate rotations - Hand-editing config files per deployment

Reliability Hierarchy `[I]`¶

Before worrying about features, nail these in order:

Monitoring — know when things break
Incident response — fix things fast
Postmortems — learn from failures (blameless)
Testing & release — catch problems before prod
Capacity planning — don't run out of runway
Efficiency — do more with less

→ Incident Management | On-Call

Cognitive Load & Oncall Health `[I]`¶

Signs of an unhealthy SRE practice: - Alert fatigue (> 5 pages/shift) - No time for project work - Incidents repeat without postmortems - On-call == firefighting, not engineering

→ See On-Call & Runbooks

Chaos Engineering `[A]`¶

Intentionally inject failures to build confidence in the system's resilience.

Principles: 1. Define steady state (normal behavior) 2. Hypothesize it continues in both control and experiment groups 3. Introduce realistic failure variables (kill a pod, inject latency, drop a region) 4. Disprove the hypothesis

Tools: Chaos Monkey, Litmus, Gremlin, AWS Fault Injection Simulator

→ Related: Scalability

SRE Fundamentals¶

What is SRE? [B]¶

SRE vs DevOps¶

The Four Golden Signals [B]¶

Error Budgets [B]¶

Toil [I]¶

Reliability Hierarchy [I]¶

Cognitive Load & Oncall Health [I]¶

Chaos Engineering [A]¶

Related Topics¶