Architecture Best Practices — Blast Radius, Least Privilege, Scalability¶

Blast Radius Minimization¶

Blast radius = the maximum impact of a single failure. Good architecture makes blast radius small, predictable, and bounded before any failure occurs.

Account / Project Isolation¶

The strongest blast radius boundary in cloud is the account boundary. A security incident, runaway cost, IAM misconfiguration, or quota exhaustion in one account cannot cross account boundaries.

❌ Everything in one account
   → A misconfigured IAM policy → access to all systems
   → A runaway Lambda → bill shock for the entire company
   → A compromised key → full blast radius

✅ One account per production workload
   prod-payments  /  prod-orders  /  prod-notifications
   → Compromise of prod-payments ≠ access to prod-orders
   → Quota exhaustion in one account doesn't starve another

Cell-Based Architecture¶

Divide workload into identical, independent cells. Each cell serves a subset of users. A cell failure impacts only that cell's users.

Reference: AWS Well-Architected — Use cell-based architecture · AWS Builders' Library — Reliability and constant work

                 ┌──── Load Balancer ────┐
                 │    (routes by hash)   │
                 └──────────┬───────────┘
                            │
          ┌─────────────────┼─────────────────┐
          │                 │                 │
    ┌─────┴──────┐   ┌──────┴─────┐   ┌──────┴─────┐
    │   Cell A   │   │   Cell B   │   │   Cell C   │
    │ users 0-33%│   │users 33-66%│   │users 66-99%│
    │            │   │            │   │            │
    │ App + DB   │   │ App + DB   │   │ App + DB   │
    │ Cache      │   │ Cache      │   │ Cache      │
    └────────────┘   └────────────┘   └────────────┘

Used by Amazon (documented in AWS Builders' Library) and Netflix. A bad deployment to Cell A — caught by canary metrics — doesn't roll to B or C.

Routing strategy:

def get_cell(user_id: str, total_cells: int = 3) -> int:
    return int(hashlib.md5(user_id.encode()).hexdigest(), 16) % total_cells

# Same user always → same cell (sticky routing)
# Cell failure → only that user's shard is affected

Bulkheads¶

Reference: Azure Architecture Center — Bulkhead Pattern · Release It! — Michael T. Nygard

Isolate resource pools per consumer type. A slow consumer exhausting the thread pool / connection pool doesn't starve other consumers.

❌ Shared connection pool
   → Reporting query holds 100 connections → API requests starve → outage

✅ Separate pools per workload class
   API requests:      pool size 50, timeout 100ms
   Background jobs:   pool size 20, timeout 30s
   Reporting queries: pool size 5,  timeout 300s

# Separate thread pools per workload
api_executor = ThreadPoolExecutor(max_workers=50)
batch_executor = ThreadPoolExecutor(max_workers=10)
report_executor = ThreadPoolExecutor(max_workers=3)

# If report_executor is saturated → only reports degrade, API is unaffected

Progressive Delivery (Canary / Ring Deployments)¶

Reference: Martin Fowler — Canary Release · AWS — Automating safe, hands-off deployments

Release to a small percentage of users first. Limit the blast radius of a bad deployment.

Ring 0: internal users only (1% of traffic)
  → validate for 30 minutes
Ring 1: 5% of users
  → validate for 1 hour
Ring 2: 20% of users
  → validate for 2 hours
Ring 3: 100% of users

# ArgoCD Rollout — canary with automatic analysis
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5      # 5% of traffic to new version
        - pause: {duration: 10m}
        - analysis:         # auto-rollback if error rate > 1%
            templates:
              - templateName: error-rate
        - setWeight: 20
        - pause: {duration: 20m}
        - setWeight: 100

Feature Flags¶

Reference: Martin Fowler — Feature Toggles

Decouple deployment from release. Ship code to 100% of servers, enable for 0% of users. Roll out gradually, roll back without a deployment.

# LaunchDarkly / Unleash / Flagsmith
if feature_flags.is_enabled("new-checkout-flow", user_id=user.id):
    return new_checkout_flow(cart)
else:
    return legacy_checkout_flow(cart)

Blast radius of a bad feature: - With flags: disable flag → instant rollback for 0% cost - Without flags: rollback deployment → minutes of downtime risk

Circuit Breakers¶

Reference: Martin Fowler — Circuit Breaker · Azure Architecture Center — Circuit Breaker Pattern

Stop calling a failing dependency. Give it time to recover instead of hammering it with failing requests.

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30, expected_exception=TimeoutError)
def call_payment_service(payload):
    return payment_client.charge(payload)

# After 5 timeouts in a row:
# → Circuit OPEN: calls fail fast (no timeout wait) for 30 seconds
# → Circuit HALF-OPEN: one test call
# → Circuit CLOSED: back to normal if test succeeds

States:

CLOSED (healthy) → failure_threshold exceeded → OPEN (fast fail)
                                                    ↓ recovery_timeout
                                               HALF-OPEN (test call)
                                                ↓              ↓
                                            success         failure
                                            CLOSED          OPEN

Least Privilege¶

Every identity — human or machine — gets exactly the permissions it needs for exactly as long as it needs them.

IAM Principles¶

❌ "just give it admin for now, we'll fix it later"
   → You won't. And the blast radius is maximum.

✅ Start with nothing. Add only what's needed. Verify with Access Analyzer.

Role per service, not shared roles:

# ✅ Each service gets its own role with minimal permissions
resource "aws_iam_role" "payments_service" {
  name = "payments-service-role"
  assume_role_policy = jsonencode({
    Statement = [{
      Action    = "sts:AssumeRoleWithWebIdentity"
      Effect    = "Allow"
      Principal = { Federated = aws_iam_openid_connect_provider.eks.arn }
      Condition = {
        StringEquals = {
          "${aws_iam_openid_connect_provider.eks.url}:sub" = "system:serviceaccount:payments:payments-api"
        }
      }
    }]
  })
}

# Only what payments needs
resource "aws_iam_role_policy" "payments_service" {
  role = aws_iam_role.payments_service.id
  policy = jsonencode({
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["dynamodb:GetItem", "dynamodb:PutItem", "dynamodb:UpdateItem"]
        Resource = "arn:aws:dynamodb:*:*:table/payments-*"
      },
      {
        Effect   = "Allow"
        Action   = ["secretsmanager:GetSecretValue"]
        Resource = "arn:aws:secretsmanager:*:*:secret:payments/*"
      }
    ]
  })
}

No long-lived keys — use roles everywhere:

AWS:   IAM Roles for EC2 / IRSA for EKS / Cognito for Lambda
GCP:   Workload Identity Federation — no service account keys
Azure: Managed Identity — no client secrets in code

Resource-level permissions, not *:

// ❌ Way too broad
{ "Action": "s3:*", "Resource": "*" }

// ✅ Scoped to exactly what's needed
{ "Action": ["s3:GetObject", "s3:PutObject"],
  "Resource": "arn:aws:s3:::company-payments-uploads/*" }

Permission Boundaries:

# Prevent delegated admins from escalating beyond their boundary
resource "aws_iam_policy" "developer_boundary" {
  name = "developer-permission-boundary"
  policy = jsonencode({
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["ec2:*", "s3:*", "lambda:*", "dynamodb:*"]
        Resource = "*"
      },
      {
        # Even if developer creates a role, it can't have IAM admin
        Effect   = "Deny"
        Action   = ["iam:CreateUser", "iam:AttachRolePolicy", "organizations:*"]
        Resource = "*"
      }
    ]
  })
}

Just-in-Time Access¶

Humans should not have persistent elevated access. Elevate when needed, revoke automatically.

AWS: IAM Identity Center → temporary role assumption with time-limited sessions
GCP: Privileged Access Manager → just-in-time grants with approval workflow
Azure: Privileged Identity Management (PIM) → activate role for N hours, then expires
HashiCorp Vault: dynamic secrets → short-lived credentials generated on-demand

# Vault dynamic AWS credentials — expires in 1 hour, never stored
vault read aws/creds/prod-readonly

# Key         Value
# lease_duration  1h
# access_key  ASIAXXX...
# secret_key  xxx...
# (automatically revoked after 1h)

Audit Everything¶

Least privilege only works if you know what's being used and can detect misuse.

✅ CloudTrail / GCP Audit Logs / Azure Monitor — every API call logged
✅ AWS IAM Access Analyzer — flags policies that allow public or cross-account access
✅ AWS Access Advisor — shows which permissions were actually used in last 90 days
✅ GCP Policy Analyzer — test what a principal can do before granting
✅ Alert on: root login, new IAM user created, policy attached to user (not role), SCP changed

# Find unused permissions in the last 90 days (use to prune roles)
aws iam get-service-last-accessed-details \
  --job-id $(aws iam generate-service-last-accessed-details \
    --arn arn:aws:iam::123456789012:role/payments-service \
    --query 'JobId' --output text)

Scalability Patterns¶

Stateless Services¶

State is the enemy of horizontal scale. A stateless service can run N copies with zero coordination.

❌ Stateful: session data stored in process memory
   → User hits server A → session exists
   → Load balancer sends next request to server B → session not found → 401

✅ Stateless: session data in Redis
   → Server A writes session to Redis
   → Server B reads session from Redis
   → Any server can handle any request

Rule: If you can kill and replace any instance without a user noticing, your service is stateless.

Horizontal vs. Vertical Scaling¶

Vertical (scale up):   t3.medium → t3.xlarge → t3.4xlarge
  ✅ Simple, no code changes
  ❌ Single point of failure, has a ceiling, expensive

Horizontal (scale out): 2 instances → 10 instances → 100 instances
  ✅ No ceiling, fault-tolerant, cost-linear with load
  ❌ Requires stateless design, needs load balancer

Always design for horizontal. Vertical is a short-term fix.

Auto-Scaling¶

# AWS Auto Scaling Group
resource "aws_autoscaling_group" "api" {
  min_size         = 2      # always at least 2 for HA
  max_size         = 20
  desired_capacity = 4

  # Scale on CPU — simple and reliable
  target_tracking_configuration {
    predefined_metric_type = "ASGAverageCPUUtilization"
    target_value           = 60.0   # scale when CPU hits 60%
  }

  # Or scale on custom metric (SQS queue depth, request rate, etc.)
  target_tracking_configuration {
    customized_metric_specification {
      metric_name = "ApproximateNumberOfMessagesVisible"
      namespace   = "AWS/SQS"
      statistic   = "Average"
    }
    target_value = 100  # scale to keep queue depth ~100
  }
}

CQRS — Command Query Responsibility Segregation¶

Separate read and write paths. Reads scale independently from writes.

Write path:  POST /orders  → Primary DB (strong consistency, ACID)
Read path:   GET  /orders  → Read replica / ElastiCache (eventual consistency OK)

                    ┌──── Write ────┐     ┌──── Read ────┐
App Server A ──────►│  Primary DB   │────►│  Replica 1   │◄─── App Server B
                    │  (1 instance) │     │  (N instances)│
                    └───────────────┘     │  Replica 2   │
                                          │  ElastiCache  │
                                          └──────────────┘

Queue-Based Load Leveling¶

Don't let spikes hit your database or downstream services directly. Queue absorbs the spike; workers drain at a controlled rate.

❌ Direct call under load spike
   10,000 req/s → DB → DB overwhelmed → timeouts cascade

✅ Queue absorbs spike
   10,000 req/s → SQS Queue → 100 workers → DB at 100 req/s
   Queue depth grows during spike, drains when load normalizes

# Producer: fast, just enqueues
sqs.send_message(QueueUrl=queue_url, MessageBody=json.dumps(order))

# Consumer: controlled pace with backpressure
while True:
    messages = sqs.receive_message(MaxNumberOfMessages=10, WaitTimeSeconds=20)
    for msg in messages.get('Messages', []):
        process_order(msg)               # controlled DB write rate
        sqs.delete_message(...)

Caching Strategy¶

Layer 1 — Browser / CDN (edge)
  ✅ Static assets: JS, CSS, images — long TTL (1 year with content hash)
  ✅ Public API responses: Cache-Control: max-age=60

Layer 2 — Application cache (Redis / Memcached)
  ✅ Hot database rows: user profiles, product catalog
  ✅ Expensive computed results: recommendation scores, aggregates
  ✅ Sessions

Layer 3 — Database query cache
  ✅ PostgreSQL: pg_stat_statements to identify hot queries
  ✅ Index coverage eliminates disk seeks

Cache invalidation strategies:
  TTL-based: simple, may serve stale data
  Write-through: update cache on every write (consistent, more writes)
  Cache-aside: app reads from cache; on miss, loads from DB and populates cache
  Event-driven: Kafka/SNS event → Lambda invalidates cache entry on change

Database Read Scaling¶

Single DB → Primary + Read Replica → Primary + N Replicas → Connection Pool
                                                                    ↓
                                                           PgBouncer / ProxySQL
                                                           routes to replica

Scale order:
1. Add read replicas (handles 80% of read-heavy apps)
2. Add connection pool (handles connection exhaustion)
3. Add cache layer (handles hot-spot reads)
4. Partition large tables (handles storage / write throughput)
5. Only then: consider sharding

Design Principles Summary¶

Principle	Implementation
Blast radius	Account isolation, cell architecture, bulkheads, circuit breakers, canary deployments
Least privilege	Role per service, no wildcards, IRSA/Workload Identity, permission boundaries, JIT access, access advisor
Stateless	Sessions in Redis, config from env/secrets manager, no local disk state
Horizontal scale	Auto-scaling groups, stateless services, read replicas, queue-based leveling
Immutable infrastructure	Replace, don't modify — new AMI/image per deploy, no SSH into prod
Fail fast	Circuit breakers, timeouts on every external call, health checks
Observability first	Metrics, logs, traces in place before launch — not after the first incident
Automate everything	If you do it twice, automate it. If it's manual, it will be wrong eventually.

Pre-Launch Architecture Checklist¶

Blast Radius - [ ] Production workloads in dedicated accounts/projects - [ ] Canary / ring deployment strategy defined - [ ] Circuit breakers on all external service calls - [ ] Feature flags in place for risky new features - [ ] Cell or shard boundary identified if needed

Least Privilege - [ ] No wildcard * resource ARNs in production IAM policies - [ ] No long-lived access keys for any human or service - [ ] Service-specific roles — not shared admin roles - [ ] IAM Access Analyzer run — zero public or cross-account findings - [ ] Root account secured, access keys deleted

Scalability - [ ] Services are stateless — verified by killing an instance during load test - [ ] Auto-scaling configured with tested scale-out and scale-in behavior - [ ] Cache layer in front of high-read database queries - [ ] Connection pooling in place (PgBouncer / RDS Proxy / ProxySQL) - [ ] Load test run at 3× expected peak — system survived

Operational - [ ] Runbook exists for every alert - [ ] Rollback procedure documented and tested - [ ] Incident response process documented - [ ] On-call rotation staffed

Well-Architected — pillar framework
Landing Zones — account isolation in practice
SRE Scalability — CAP theorem, load patterns
Platform Security — IAM, secrets, hardening
DBRE Scaling — database-specific scale patterns
Messaging Best Practices — queue-based load leveling
system-design-primer — comprehensive system design reference
awesome-system-design — curated system design resources