Queues, DLQ, retries
Async work needs durable queues, bounded retries, and a dead letter queue for the messages that will never succeed.
A queue is how you decouple a producer from a consumer when the consumer is slow, flaky, or down. It buys you durability, smoothing, and back-pressure. It also introduces three problems: retries, ordering, and poison messages.
The retry rule
Retry transient failures with exponential backoff and jitter. After N attempts (typically 5-10), move the message to a dead letter queue (DLQ). Never retry forever.
Without bounded retries, one poison message (a message your code cannot process, like malformed JSON or a deleted resource) will retry forever, block consumer workers, and exhaust the queue.
Backoff math
Start with 1 second. Double each attempt. Add jitter to prevent thundering herd. After 5 attempts: 1, 2, 4, 8, 16 seconds. After 10 attempts: about 17 minutes total. Cap at a maximum delay (60 seconds is common).
delay = min(MAX, base * 2^attempt) + random(0, base)
What goes in the DLQ
- Messages that failed N times.
- Messages older than a TTL (the producer's deadline has passed).
- Messages from a deleted or invalidated tenant.
Treat DLQ as an alert source. A growing DLQ is a bug, not normal traffic.
At-least-once is the only honest semantic
Exactly-once is mostly a lie. Real queues guarantee at-least-once: a message will be delivered one or more times. Your consumer must be idempotent. This is the connection to idempotency keys.
If a worker crashes after doing the work but before acking, the message reappears. Without idempotency, you double-process.
Visibility timeout
When a consumer pulls a message, the queue hides it for a "visibility timeout" (SQS default 30s). If the consumer doesn't ack within that window, the message reappears for another consumer.
Set this longer than your typical processing time but not so long that crashes cause big delays. For 10s processing: 60s visibility timeout.
Ordering
Most queues do NOT guarantee order. SQS standard, Pub/Sub, RabbitMQ default - all unordered. If you need order:
- FIFO queues (SQS FIFO, Kafka per-partition). Order within a partition key. Slower, more expensive.
- Resequence in the consumer. Sort by sequence number, buffer briefly.
Most apps do not need order. Idempotent processing handles the rest.
My default stack
SQS for AWS shops. Kafka for high-throughput event streams. Redis Streams for low-latency dev work. Always: bounded retries, DLQ, idempotent consumer, monitoring on queue depth and DLQ depth.
Learn more
- DocsAWS SQS: Dead-letter queuesAWS docs
- ArticleAWS Builders Library: Avoiding overload using load sheddingAWS Builders Library
- DocsRabbitMQ: Retry patternsRabbitMQ docs