Queues, DLQ, retries
Queue semantics, retry strategies, poison message handling, ordering tradeoffs, and the difference between Kafka, SQS, RabbitMQ in practice.
Queues are the load-bearing wall of async systems. They look simple: producer puts, consumer gets. The complexity is in the failure modes. Every production queue I have operated has a story about a poison message that took it down, an ordering bug that corrupted data, or a DLQ that grew silently for weeks. Here is what I would teach a new engineer.
Semantics first
A queue has three delivery guarantees, pick one:
- At-most-once. Fire and forget. Message may be lost. Fastest. Used for telemetry where some loss is acceptable.
- At-least-once. Message delivered one or more times. Standard for SQS, Pub/Sub, RabbitMQ. Requires idempotent consumers.
- Exactly-once. Marketing claim. Almost always implemented as at-least-once + idempotent consumer. Kafka has "exactly-once semantics" via transactional writes, but only within Kafka itself.
Plan for at-least-once. The retries section explains why.
The retry algorithm
A transient failure should be retried. A permanent failure should not. The hard part is telling them apart.
Transient: network timeout, 503, lock conflict, Redis briefly unavailable.
Permanent: validation error, 404 on a deleted resource, malformed JSON, business rule violation.
Code the consumer to distinguish:
def process(msg):
try:
do_work(msg)
except TransientError:
raise # let queue redeliver
except PermanentError as e:
send_to_dlq(msg, reason=str(e))
ack(msg) # do not redeliverWithout this distinction, your DLQ fills with transient failures (useless noise) or your queue fills with permanent failures (consumer poison).
Exponential backoff with jitter
The pattern:
delay = min(MAX_DELAY, BASE * 2 ** attempt)
delay = delay + random.uniform(0, BASE) # jitterWhy jitter? Without it, 1000 consumers all retry at exactly the same time after a downstream blip, creating a thundering herd that takes the downstream down again. Jitter spreads the load.
The AWS Builders Library article on "timeouts, retries, and backoff with jitter" is the canonical reference. They tested several jitter strategies and "full jitter" (random between 0 and the computed delay) won.
After 5-10 attempts, give up and DLQ. Beyond 10 attempts you are very likely retrying a permanent failure that just looks transient.
Visibility timeout vs ack/nack
Two queue paradigms:
Pull + visibility timeout (SQS, Pub/Sub). Consumer polls, gets a message, queue hides it for T seconds. Consumer must call DeleteMessage within T. If consumer dies, message reappears.
Push + ack (RabbitMQ, AMQP). Broker pushes message, expects ack. If channel closes without ack, message is requeued.
Pull is simpler to reason about. Push has lower latency. SQS is pull. Kafka is pull. Most modern systems use pull.
Set visibility timeout to 3x your p99 processing time. Too short and slow processing causes duplicates. Too long and crashes cause big delays.
Poison messages and DLQ
A poison message is one your consumer cannot process. Maybe:
- A code bug where a specific field crashes the deserializer.
- A reference to a deleted parent resource.
- An auth token that expired between enqueue and processing.
Without a DLQ, poison messages block the queue. The consumer pulls, fails, requeues, pulls again, fails again. In an unordered queue this just wastes resources. In an ordered queue (Kafka per-partition), it blocks all subsequent messages forever.
DLQ rules:
- Bound retries. SQS auto-DLQs after N receives. Configure N=5 or N=10.
- Alert on growth. A non-zero DLQ is a bug. Alert when count > 0 for >10 minutes.
- Triage workflow. A human looks at the DLQ daily. Decides: replay, fix and replay, or discard.
- Replay tool. A script that pulls from DLQ and re-enqueues to the main queue. Run after the fix is deployed.
A DLQ without a triage process is a swept-under-rug failure. Engineers learn to ignore it.
Ordering
Three ordering models:
No ordering (SQS standard, Pub/Sub). Messages may arrive in any order. Highest throughput. Use when your consumer is idempotent and order does not matter.
Per-key ordering (Kafka, SQS FIFO, Pub/Sub ordering keys). Messages with the same key are processed in order. Different keys are independent. Use when "all events for user X must be in order, but X and Y are independent."
Global ordering. All messages in one stream, processed serially by one consumer. Lowest throughput. Rare. Useful for write-ahead logs.
Most systems need per-key ordering. Pick the partition key carefully (usually tenant ID or user ID). A bad key choice creates hot partitions where one consumer can't keep up.
Backpressure and load shedding
When a queue grows, something is wrong. Either:
- Producers are too fast.
- Consumers are too slow.
- Consumers are failing.
The queue absorbs short bursts. But an unbounded queue with sustained imbalance is just delayed failure. The producer hits "queue full" or runs out of memory eventually.
Strategies:
- Bounded queue size with rejection. Producer gets "queue full" and decides: drop, retry later, alert.
- Load shedding. Drop low-priority messages when queue depth exceeds threshold. Telemetry first, payments last.
- Autoscale consumers. Scale consumer count on queue depth. Works when consumers are stateless and the bottleneck is consumer CPU.
The AWS Builders Library article on load shedding is required reading.
Kafka vs SQS vs RabbitMQ
SQS. Managed, pay-per-message, no ops. Unordered (or FIFO at lower throughput). Use when you want zero ops and per-key ordering is enough.
Kafka. Self-managed (or MSK / Confluent). Ordered per-partition. Multiple consumers can read the same topic (consumer groups). Retains messages for days, enabling replay. Use when you need high throughput, replay, or multiple consumers.
RabbitMQ. AMQP semantics, push-based, complex routing (exchanges, bindings). Use when you need traditional message broker features (priorities, dead letter exchanges, fanout). Heavier to operate than SQS.
Redis Streams. Lightweight, in-memory with persistence. Good for dev or low-volume production. Don't use for mission-critical durability.
NATS / NATS JetStream. Fast, simple, low-latency. Growing in popularity for cloud-native systems.
Pick based on:
- Volume. <10k/sec: anything. >100k/sec: Kafka or NATS.
- Durability. Critical: Kafka, SQS, RabbitMQ with persistence. Best-effort: NATS core.
- Replay. Needed: Kafka. Not needed: anything.
- Ops appetite. None: SQS. Some: managed Kafka. Lots: self-host anything.
Idempotent consumers in practice
Every consumer must handle being called twice for the same message. Patterns:
Conditional UPDATE. UPDATE orders SET status = 'shipped' WHERE id = $1 AND status = 'paid'. If 0 rows affected, already done.
Insert with ON CONFLICT. INSERT INTO events (id, ...) VALUES (...) ON CONFLICT DO NOTHING. The first insert wins.
Optimistic concurrency. Compare version field. If stale, retry or skip.
Side-effect log. Before calling external service, write to a pending_calls table. After, mark complete. On retry, check table.
The pattern depends on what you're doing. Always have one.
Observability
Track:
- Queue depth. Trending up = problem.
- Oldest message age. Spikes = consumers stuck.
- Consumer lag (Kafka). Per consumer group.
- DLQ depth. Non-zero = investigate.
- Per-message processing latency. Histogram. p99 should be far below visibility timeout.
- Retry count distribution. High retries = downstream flaky.
Alert thresholds depend on your SLA. For most apps: depth > 10x normal, oldest > 5x normal processing time, DLQ > 0 for 10 min.
Operational checklist
Before going live:
- DLQ configured with maxReceiveCount of 5-10.
- Alert on DLQ depth > 0.
- Alert on queue depth > N.
- Visibility timeout = 3x p99 processing time.
- Consumer is idempotent. Tested by manually redelivering messages.
- Retry backoff with jitter on transient errors.
- Permanent errors go to DLQ immediately, not retried.
- DLQ replay tool exists and is documented.
- Runbook for "queue is growing."
Half the production queue incidents I have seen failed one of these checks.
A final opinion
Queues are simple to start, hard to operate well. If you can avoid one, do. A direct synchronous call is easier to debug than a queue. Use queues when you need durability across consumer failures, want to decouple producer and consumer rates, or need to process at a different scale than you receive. Otherwise, just call the function.
Learn more
- Docs
- ArticleAWS Builders Library: timeouts, retries, and backoff with jitterAWS Builders Library
- DocsKafka: the definitive guideApache Kafka
- DocsRabbitMQ: dead letter exchangesRabbitMQ docs
- ArticleDesigning Data-Intensive Applications, ch 11Martin Kleppmann