Cloud Infrastructure¶
Cloud Fundamentals [B]¶
The big three:
| Provider | Kubernetes | Managed DB | Object Storage | Serverless |
|---|---|---|---|---|
| AWS | EKS | RDS, Aurora | S3 | Lambda |
| GCP | GKE | Cloud SQL, AlloyDB | GCS | Cloud Functions |
| Azure | AKS | Azure SQL, Cosmos | Blob Storage | Azure Functions |
Multi-cloud reality: Most companies pick one cloud. Tooling (Terraform, K8s, OTEL) provides portability where it matters.
Networking Fundamentals [B]¶
VPC / Virtual Network¶
VPC: 10.0.0.0/16
├── Public subnet: 10.0.1.0/24 (internet-facing: ALB, NAT GW)
├── Public subnet: 10.0.2.0/24 (AZ-2)
├── Private subnet: 10.0.10.0/24 (app servers, K8s nodes)
├── Private subnet: 10.0.11.0/24 (AZ-2)
├── Private subnet: 10.0.20.0/24 (databases)
└── Private subnet: 10.0.21.0/24 (databases AZ-2)
Key rules: - Databases NEVER in public subnets - App servers in private subnets, accessed via load balancer - Use NAT Gateway for outbound internet from private subnets - Use Security Groups (instance-level) + NACLs (subnet-level) for access control
DNS¶
- Route53 (AWS), Cloud DNS (GCP), Azure DNS
- Private hosted zones for internal service discovery
- Health check routing — failover to healthy endpoint automatically
Compute [I]¶
EC2 / VM Best Practices¶
- Use Launch Templates (not Launch Configurations)
- Auto Scaling Groups for horizontal scaling
- Spot/Preemptible instances for cost savings on non-critical workloads (40-90% cheaper)
- Always use instance metadata service v2 (IMDSv2) for security
Container Options¶
| Option | Good for | Not good for |
|---|---|---|
| EKS/GKE/AKS | Complex microservices, large teams | Simple single services |
| ECS (AWS) | AWS-native, simpler than K8s | Multi-cloud, complex scheduling |
| Cloud Run (GCP) | Serverless containers, variable load | Long-running jobs |
| Fargate (AWS) | Serverless containers on ECS/EKS | Cost-sensitive high-throughput |
Serverless¶
Lambda/Cloud Functions: great for event-driven, bursty, short-lived work.
Cold start problem: First invocation is slow. Mitigations: provisioned concurrency, keep-alive pings, optimize package size.
Storage [I]¶
Object Storage (S3/GCS)¶
resource "aws_s3_bucket" "data" {
bucket = "my-app-data-prod"
}
resource "aws_s3_bucket_versioning" "data" {
bucket = aws_s3_bucket.data.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}
Always enable: versioning, encryption, access logging, block public access.
Block Storage (EBS/Persistent Disk)¶
- gp3 is default for most workloads (better than gp2, same cost or cheaper)
- io2 for high-IOPS databases
- Snapshots for backups → see DBRE: Backup & Recovery
File Storage (EFS/Filestore)¶
- Shared filesystem across multiple instances/pods
- Use for: shared configuration, content uploads, legacy apps
- More expensive than S3; only use when POSIX filesystem is needed
Load Balancing [I]¶
| AWS | Type | Use case |
|---|---|---|
| ALB | Application (L7) | HTTP/HTTPS, path routing, microservices |
| NLB | Network (L4) | TCP, UDP, extreme performance, fixed IP |
| CLB | Classic (deprecated) | Legacy only |
ALB features: - Path-based and host-based routing - Target groups (instances, IPs, Lambda, K8s pods) - Native integration with WAF, Shield, Cognito - HTTP/2, WebSocket support
Cost Management [I]¶
Common Cost Mistakes¶
- Oversized instances (right-size with CloudWatch metrics)
- Unattached EBS volumes after instance termination
- Old snapshots never deleted
- Data transfer costs ignored (same-region: free; cross-region: $$$)
- NAT Gateway overuse (expensive — $0.045/GB processed)
- Unused Elastic IPs
Cost Optimization Strategies¶
- Reserved Instances / Savings Plans — commit 1-3 years, save 40-60%
- Spot Instances — 60-90% discount, handle interruptions
- S3 Intelligent-Tiering — auto moves data to cheaper tiers
- Right-sizing — use AWS Compute Optimizer recommendations
- Lifecycle policies — delete old logs, move to Glacier
Tagging for Cost Attribution¶
locals {
tags = {
Team = "platform"
Service = "my-app"
Environment = "prod"
CostCenter = "eng-platform"
}
}
Enable AWS Cost Allocation Tags → filter billing by team/service.
High Availability Patterns [A]¶
Multi-AZ¶
Run in at least 2 (prefer 3) Availability Zones: - Separate failure domains (power, network, physical) - ALB routes away from unhealthy AZ - RDS Multi-AZ: automatic failover in 1-2 min
Multi-Region¶
For disaster recovery or global user base: - Active-Active: both regions serve traffic (complex, most resilient) - Active-Passive: primary region, failover to secondary (simpler, more downtime) - Cross-region replication: S3, DynamoDB, RDS read replicas
Recovery Objectives¶
| Metric | Definition | Target |
|---|---|---|
| RTO | Recovery Time Objective — max acceptable downtime | Hours for tier-2, minutes for tier-1 |
| RPO | Recovery Point Objective — max acceptable data loss | Hours for tier-2, near-zero for critical |
→ See DBRE: Backup & Recovery for database-specific recovery.
Related Topics¶
- Terraform & IaC — provision all of the above as code
- Kubernetes — EKS/GKE setup
- Security & Hardening — IAM, VPC security
- DBRE: Backup & Recovery
- SRE: Scalability
- book-of-secret-knowledge — AWS/cloud CLI one-liners