Queues, DLQ, retries

Async work needs durable queues, bounded retries, and a dead letter queue for the messages that will never succeed.

A queue is how you decouple a producer from a consumer when the consumer is slow, flaky, or down. It buys you durability, smoothing, and back-pressure. It also introduces three problems: retries, ordering, and poison messages.

The retry rule

Retry transient failures with exponential backoff and jitter. After N attempts (typically 5-10), move the message to a dead letter queue (DLQ). Never retry forever.

Without bounded retries, one poison message (a message your code cannot process, like malformed JSON or a deleted resource) will retry forever, block consumer workers, and exhaust the queue.

Bounded retries with DLQ as the safety net

Backoff math

Start with 1 second. Double each attempt. Add jitter to prevent thundering herd. After 5 attempts: 1, 2, 4, 8, 16 seconds. After 10 attempts: about 17 minutes total. Cap at a maximum delay (60 seconds is common).

delay = min(MAX, base * 2^attempt) + random(0, base)

What goes in the DLQ

Messages that failed N times.
Messages older than a TTL (the producer's deadline has passed).
Messages from a deleted or invalidated tenant.

Treat DLQ as an alert source. A growing DLQ is a bug, not normal traffic.

FIFO queues (SQS FIFO, Kafka per-partition). Order within a partition key. Slower, more expensive.
Resequence in the consumer. Sort by sequence number, buffer briefly.

Most apps do not need order. Idempotent processing handles the rest.

My default stack

SQS for AWS shops. Kafka for high-throughput event streams. Redis Streams for low-latency dev work. Always: bounded retries, DLQ, idempotent consumer, monitoring on queue depth and DLQ depth.

Learn more

Docs
AWS SQS: Dead-letter queuesAWS docs
Article
AWS Builders Library: Avoiding overload using load sheddingAWS Builders Library
Docs
RabbitMQ: Retry patternsRabbitMQ docs