Messaging & Queues¶

← Home

Message brokers, event streaming, and queue systems are the connective tissue of distributed systems. This section covers every major system, when to use each, and how to operate them reliably.

Topics¶

Topic	What you'll learn
Kafka	Deep dive — partitions, consumer groups, exactly-once, operations
Best Practices	Do's and don'ts across Kafka, RabbitMQ, SQS, and general messaging
External Links	Official docs, design guides, benchmark references

System Comparison¶

System	Model	Ordering	Retention	Throughput	Best For
Apache Kafka	Log / pub-sub	Per partition	Configurable (days/forever)	Millions/s	Event streaming, audit log, CDC
RabbitMQ	Message queue	Per queue	Until consumed	Tens of thousands/s	Task queues, routing, RPC
Amazon SQS	Queue	Best-effort (FIFO: per group)	Up to 14 days	Unlimited (managed)	Decoupling AWS services, serverless
Amazon SNS	Pub/sub fanout	None	None (fire-and-forget)	Unlimited (managed)	Fan-out to SQS/Lambda/HTTP
Apache Pulsar	Log + queue	Per partition	Configurable	Millions/s	Multi-tenant, geo-replication, Kafka alternative
NATS	Pub/sub / queue	Per stream (JetStream)	JetStream only	Very high	Low-latency microservices, IoT
Google Pub/Sub	Pub/sub	Best-effort (ordering key for strict)	7 days	Unlimited (managed)	GCP event pipelines
Azure Service Bus	Queue + topic	Per session	Up to 14 days	Managed	Azure ecosystem, enterprise messaging
Redis Streams	Log	Per stream	Configurable	High	Lightweight streaming, in-process events

Messaging Patterns¶

Point-to-Point (Queue)¶

One producer, one consumer processes each message. Work distribution. Used by SQS, RabbitMQ queues.

Producer → [Queue] → Consumer A (processes message, removed from queue)
                   → Consumer B (waiting — gets next message)

One producer, many consumers each get a copy. Used by Kafka topics, SNS, Pub/Sub.

Producer → [Topic] → Consumer Group A (analytics)
                   → Consumer Group B (notifications)
                   → Consumer Group C (audit log)

SNS delivers to multiple SQS queues simultaneously. Standard AWS decoupling pattern.

SNS Topic → SQS Queue A (email service)
          → SQS Queue B (audit service)
          → SQS Queue C (analytics)
          → Lambda (real-time processor)

Dead Letter Queue (DLQ)¶

Failed messages go to a DLQ after N retries. Prevents poison messages from blocking queues.

Producer → Queue → Consumer (fails 3×) → DLQ → Alert / Manual review

Outbox Pattern¶

Prevents the dual-write problem: write to DB and message broker atomically.

1. Write to orders table + outbox table in one transaction
2. Outbox worker reads new rows, publishes to Kafka/SQS
3. Delete from outbox after publish confirmed

Eliminates race conditions between DB commit and broker publish.

Claim Check (Large Payload)¶

Kafka/SQS have message size limits (Kafka: 1MB default, SQS: 256KB). Store payload in S3, send only a pointer.

Producer → S3 (full payload) → Kafka message: { "s3_key": "events/2024/order-123.json" }
Consumer → reads Kafka message → fetches from S3

Saga Pattern (Distributed Transactions)¶

Sequence of local transactions, each publishing an event that triggers the next step. On failure: compensating transactions.

Order Service → OrderCreated event
  → Payment Service → PaymentProcessed event
    → Inventory Service → InventoryReserved event
      → Shipping Service → ShipmentCreated event

On failure at any step → compensating events (PaymentRefunded, InventoryReleased)

When to Use What¶

Scenario	Recommended	Why
High-throughput event streaming (millions/s)	Kafka	Durable log, replay, consumer groups scale independently
Audit log / event sourcing	Kafka (compacted topic)	Immutable, replayable, retains history
Change Data Capture (CDC) from DB	Kafka + Debezium	Reads DB binlog, streams changes as events
Task queue — background jobs	RabbitMQ or SQS	Work distribution, one consumer per message
Complex routing (topic/header/fanout)	RabbitMQ	Exchange types give fine-grained routing control
Serverless / AWS-native workloads	SQS + SNS	Zero ops, triggers Lambda, pay-per-use
Fan-out to multiple AWS services	SNS → SQS	One publish, N independent queues with own retry/DLQ
Ordered processing per entity	Kafka (keyed) or SQS FIFO	Kafka: per partition; SQS FIFO: per MessageGroupId
Low-latency microservice messaging (< 1ms)	NATS	In-memory, no persistence overhead by default
Multi-region active-active	Kafka (MirrorMaker 2) or Pulsar	Pulsar has native geo-replication; Kafka needs MirrorMaker
GCP-native event pipeline	Google Pub/Sub	Serverless, integrates with Dataflow, BigQuery
Azure-native messaging	Azure Service Bus	Sessions, DLQ, transactions in the Azure ecosystem
Real-time stream processing + SQL	Kafka + ksqlDB or Flink	Continuous queries over live event streams
Lightweight in-app event bus	Redis Streams	Already have Redis? Streams add durable pub/sub
Request-reply (synchronous)	gRPC / HTTP	Don't use a broker for sync — wrong tool
Small team, no ops capacity	SQS / SNS or managed Kafka (Confluent/MSK)	Remove operational complexity entirely

Fan-out Patterns¶

Fan-out = one event published once, delivered to multiple independent consumers. Critical for decoupling services.

Each downstream service owns its SQS queue. Failures in one service don't affect others.

                    ┌─ SQS: email-service     (retries 3×, DLQ: email-dlq)
                    │
SNS: order.created ─┼─ SQS: inventory-service (retries 5×, DLQ: inventory-dlq)
                    │
                    ├─ SQS: analytics-service (retries 1×, no DLQ — best-effort)
                    │
                    └─ Lambda: fraud-check    (immediate, synchronous fan-out leg)

Why SQS in front of Lambda (not SNS → Lambda directly)? - SQS buffers: Lambda throttling doesn't lose messages - SQS retries independently per queue - SQS DLQ catches Lambda failures; SNS → Lambda has no DLQ

# Subscribe SQS queues to SNS topic (Terraform)
resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.order_created.arn
  protocol  = "sqs"
  endpoint  = aws_sqs_queue.email_service.arn
}

# SNS filter policy — email service only gets paid orders
resource "aws_sns_topic_subscription" "email_filtered" {
  filter_policy = jsonencode({
    status = ["paid", "refunded"]
  })
}

Kafka Consumer Groups (fan-out with replay)¶

Each consumer group reads the full topic independently. Unlike SNS/SQS, messages persist — new consumer groups can replay from the beginning.

Kafka Topic: orders
  │
  ├─ Consumer Group: payments-service     (at offset 10,240)
  ├─ Consumer Group: analytics-service    (at offset 10,195 — 45 behind)
  ├─ Consumer Group: notification-service (at offset 10,240)
  └─ Consumer Group: audit-service        (at offset 0 — replaying history)

One team adding a new consumer group never touches other groups. No SNS subscription management needed.

RabbitMQ Fanout Exchange¶

Delivers to all bound queues unconditionally — no routing key used.

                    ┌─ Queue: email-worker
                    │
Fanout Exchange ────┼─ Queue: sms-worker
                    │
                    └─ Queue: push-notification-worker

# Declare fanout exchange
channel.exchange_declare('order.notifications', 'fanout', durable=True)

# Each service binds its own queue — routing key ignored
channel.queue_bind('email-worker',  'order.notifications', routing_key='')
channel.queue_bind('sms-worker',    'order.notifications', routing_key='')
channel.queue_bind('push-worker',   'order.notifications', routing_key='')

# Publish once → all three queues receive a copy
channel.basic_publish(exchange='order.notifications', routing_key='', body=msg)

RabbitMQ Topic Exchange (selective fan-out)¶

Routing key pattern matching. More control than fanout — queues receive only matching events.

Topic Exchange: events
  │
  ├─ Queue: payments  (binding: orders.#)        ← gets orders.created, orders.paid, orders.refunded
  ├─ Queue: shipping  (binding: orders.paid)     ← gets only orders.paid
  └─ Queue: all-events (binding: #)              ← gets everything

channel.exchange_declare('events', 'topic', durable=True)
channel.queue_bind('payments', 'events', routing_key='orders.#')
channel.queue_bind('shipping', 'events', routing_key='orders.paid')

# Only shipping and payments receive this:
channel.basic_publish(exchange='events', routing_key='orders.paid', body=msg)

Fan-out Anti-patterns¶

❌ Shared queue for multiple consumer types
   → one consumer starves the others; tight coupling

❌ SNS → Lambda directly for high-volume fan-out
   → no buffering, Lambda throttling = dropped messages

❌ Broadcasting large payloads to all consumers
   → use claim check: put payload in S3, fan-out the S3 key

❌ Fan-out with a synchronous call chain
   → if one downstream is slow, the whole fan-out blocks
   → async fan-out only; fire and forget per leg

Cloud-Managed vs. Self-Managed¶

	Self-Managed (Kafka / RabbitMQ)	Cloud-Managed (SQS / SNS / Pub/Sub)
Setup	Cluster config, ZooKeeper/KRaft, networking	Zero setup — API only
Ops burden	Broker tuning, partition rebalancing, upgrades	None
Cost	Infra only; high engineering time	Pay per message/GB — can be expensive at scale
Ordering	Strong (per partition)	Weak (SQS standard); strong (SQS FIFO, per group)
Retention	Unlimited	14 days max (SQS), 7 days (Pub/Sub)
Replay	Yes — seek to any offset	No — once consumed, gone
Throughput ceiling	Very high — your infra limit	Effectively unlimited (managed)
Lock-in	None — open source	High — especially Kinesis/Pub/Sub

Learning Path¶

[B] Patterns (pub/sub, queue, fan-out) → SQS/SNS (managed, zero ops)
[I] RabbitMQ (routing, DLX, confirms) → Kafka basics (topics, partitions, consumer groups)
[A] Kafka operations → Exactly-once semantics → Schema Registry → Kafka Streams / Flink

Key Resources¶

awesome-kafka — curated Kafka tools, clients, and resources
awesome-scalability — messaging patterns at scale

← DBRE | ← SRE | ← Platform