In revision.
Crisp5 min readGo deeper →

Queues, DLQ, retries

Async work needs durable queues, bounded retries, and a dead letter queue for the messages that will never succeed.

A queue is how you decouple a producer from a consumer when the consumer is slow, flaky, or down. It buys you durability, smoothing, and back-pressure. It also introduces three problems: retries, ordering, and poison messages.

The retry rule

Retry transient failures with exponential backoff and jitter. After N attempts (typically 5-10), move the message to a dead letter queue (DLQ). Never retry forever.

Without bounded retries, one poison message (a message your code cannot process, like malformed JSON or a deleted resource) will retry forever, block consumer workers, and exhaust the queue.

Bounded retries with DLQ as the safety net

Backoff math

Start with 1 second. Double each attempt. Add jitter to prevent thundering herd. After 5 attempts: 1, 2, 4, 8, 16 seconds. After 10 attempts: about 17 minutes total. Cap at a maximum delay (60 seconds is common).

delay = min(MAX, base * 2^attempt) + random(0, base)

What goes in the DLQ

  • Messages that failed N times.
  • Messages older than a TTL (the producer's deadline has passed).
  • Messages from a deleted or invalidated tenant.

Treat DLQ as an alert source. A growing DLQ is a bug, not normal traffic.

At-least-once is the only honest semantic

Exactly-once is mostly a lie. Real queues guarantee at-least-once: a message will be delivered one or more times. Your consumer must be idempotent. This is the connection to idempotency keys.

If a worker crashes after doing the work but before acking, the message reappears. Without idempotency, you double-process.

Visibility timeout

When a consumer pulls a message, the queue hides it for a "visibility timeout" (SQS default 30s). If the consumer doesn't ack within that window, the message reappears for another consumer.

Set this longer than your typical processing time but not so long that crashes cause big delays. For 10s processing: 60s visibility timeout.

Ordering

Most queues do NOT guarantee order. SQS standard, Pub/Sub, RabbitMQ default - all unordered. If you need order:

  • FIFO queues (SQS FIFO, Kafka per-partition). Order within a partition key. Slower, more expensive.
  • Resequence in the consumer. Sort by sequence number, buffer briefly.

Most apps do not need order. Idempotent processing handles the rest.

My default stack

SQS for AWS shops. Kafka for high-throughput event streams. Redis Streams for low-latency dev work. Always: bounded retries, DLQ, idempotent consumer, monitoring on queue depth and DLQ depth.

Learn more