Pub-sub patterns
Topics, consumer groups, schemas, ordering, replay, and the organizational implications of event-driven architectures.
Pub-sub is the substrate of modern distributed systems. It is also the source of more architectural confusion than any other pattern I have encountered. Engineers conflate topics with queues, fan-out with work distribution, and event-driven with eventually consistent. This is the longer version of "what is actually going on."
Three messaging models in one diagram
The third model is what most production systems actually use. Kafka invented it. Pub/Sub copied it. SQS+SNS approximates it.
Topics vs queues, precisely
Queue. A message is delivered to exactly one consumer in the consumer set. Use when work needs to be distributed.
Topic. A message is delivered to every subscriber. Use when work needs to be fanned out.
Consumer group on a topic. Within a group, the topic behaves like a queue. Across groups, like a topic. Lets you have "all email-service pods share the work, but the email service and analytics service both see everything."
Push vs pull delivery
Push. The broker initiates delivery. Webhook-style. Lower latency. Requires the subscriber to be available; if not, broker retries or DLQs.
Pull. The subscriber initiates. Polls broker for new messages. Higher latency (poll interval). Subscriber controls flow control, can autoscale.
Modern systems mostly use pull. Push is brittle: subscriber down = message stuck in retry loop. Pull is robust: subscriber down = backlog grows, recovers when subscriber returns.
The exception is webhooks to external systems where you cannot run a pull-based consumer in their infra.
Ordering
Three ordering levels:
No ordering. Any consumer can get any message in any order. Maximum throughput. SQS standard, NATS core.
Per-key ordering. Messages with the same key are processed in order. Different keys are independent. Kafka per partition, Pub/Sub ordering keys.
Total ordering. All messages globally ordered. Single partition, single consumer. Rare; only for very low-throughput audit logs.
Pick per-key. The key is usually the entity ID (user ID, order ID, tenant ID). All events for one entity arrive in order; different entities are processed in parallel.
Key choice matters. A poorly chosen key creates hot partitions where one consumer is overwhelmed while others sit idle.
Schemas: the missing discipline
Pub-sub couples producer and consumer through the message format. With no schema, a producer change silently breaks subscribers.
Real story: team A renames customer_id to cust_id. Tests pass. Deploy. Team B's analytics pipeline silently drops every event because the field name no longer matches. Days later someone notices.
Defense:
- Schema registry. Confluent's is the standard. Producers must register schemas. Consumers fetch schemas to deserialize.
- Format with built-in schema. Avro and Protobuf. JSON has no schema; you need JSON Schema bolt-on.
- Compatibility rules. Backward (new consumer reads old data), forward (old consumer reads new data), or full (both). Most teams choose backward or full.
Compatibility rules in practice:
- Adding a field with a default = backward compatible.
- Removing a field = breaks backward compatible.
- Renaming a field = breaks everything.
- Changing a type = breaks everything.
The registry enforces this on publish. Reject incompatible changes at deploy time, not at runtime.
Replay
A killer feature of log-based pub-sub (Kafka): messages are retained for days or weeks. A new consumer can start from the beginning and process the entire history.
Use cases:
- Bootstrapping a new service. New analytics service reads from offset 0, builds historical state.
- Backfilling after a bug. Bug in consumer mis-processed events. Fix, reset offset, replay.
- A/B testing data pipelines. Run new pipeline on old data, compare to current.
This is what makes Kafka different from a queue. A queue is delete-on-consume. Kafka is append-only log with consumer-owned offsets.
Event-driven architecture
A style where services communicate by emitting events rather than calling each other directly.
Old way: Order Service calls Email Service, Analytics Service, Inventory Service synchronously.
New way: Order Service publishes order.created. All three services subscribe.
Pros:
- Loose coupling. Add a new subscriber without changing the publisher.
- Resilience. Subscriber down = backlog, not error.
- Audit log. The event stream is a complete history.
Cons:
- Hard to reason about. "What happens when an order is created?" requires checking every subscriber.
- Eventual consistency. Synchronous requests are easy: it either succeeded or didn't. Async events: it succeeded somewhere, might still be processing somewhere else.
- Schema discipline becomes critical. Without it, you have chaos.
- Debugging is harder. Trace IDs through async events need work.
EDA is right for some systems and wrong for others. It is not a default. It is a tool.
Event sourcing vs event-driven
Often confused. They are different.
Event-driven architecture. Services use events for communication. State still lives in databases.
Event sourcing. The event log IS the state. State is rebuilt by replaying events. See the event sourcing section.
You can do EDA without event sourcing. You can do event sourcing without exposing events to other services. Most teams should start with EDA only.
At-least-once delivery + idempotent consumers
Pub-sub systems guarantee at-least-once. Consumers will see the same message twice during failovers. The consumer must be idempotent (see idempotency section).
This applies even to "exactly-once" systems. Kafka EOS is exactly-once within Kafka (producer write + consumer read + commit are atomic). Once your consumer side-effects leave Kafka (database write, HTTP call), you're back to at-least-once. Plan for it.
Consumer lag
The most important metric for pub-sub:
- Lag = latest offset - committed offset. How many unprocessed messages behind.
- Lag time = time since the message at committed offset was produced. How fresh is the consumer.
Spiking lag means the consumer can't keep up. Causes:
- Consumer slow (downstream bottleneck).
- Consumer down (pods crashed).
- Producer burst (sudden spike).
- Repartitioning (consumer was kicked from group, rejoining).
Alert on lag > threshold. The threshold depends on your SLA. For real-time systems: lag > 10s = alarm. For analytics: lag > 5min = alarm.
Operational considerations
DLQ for poison messages
Same as queues. Consumer fails N times on a message? DLQ it. Don't block the partition.
Kafka complication: DLQ is just another topic. You need to write the DLQ consumer separately.
Partition rebalancing
When a consumer joins or leaves a group, Kafka rebalances partition assignments. During rebalance, no consumption happens. Frequent rebalancing = throughput drops.
Causes:
- Consumer crashes (kicks them out, triggers rebalance).
- New consumer pod (rolling deploy).
- Consumer takes too long between polls (heartbeat timeout).
Tune: long enough heartbeat to survive GC pauses, short enough to detect dead consumers quickly. session.timeout.ms = 30s is a common starting point.
Cross-region
Kafka Mirror Maker 2 replicates topics across regions. Pub/Sub has cross-region subscribers. Cost: latency, $ for bandwidth.
Default to single-region. Add cross-region only when you have a real need (DR, geographic users).
When NOT to use pub-sub
- Simple request-response. Use HTTP.
- Strong consistency needed. Use distributed transactions or saga (see saga section).
- Single consumer, no fan-out. Use a queue, not a topic.
- Tiny scale. Cron job calling a function is simpler.
A common anti-pattern: "we'll use Kafka for everything." Then you have 50 topics with one consumer each, all of which could have been HTTP calls. Now you have Kafka ops to deal with.
Tools comparison
Kafka. Best for: high throughput, ordered per partition, replay, multiple consumer groups. Cost: ops are real. Self-hosted is hard. Managed (MSK, Confluent Cloud) is expensive but worth it.
Google Pub/Sub. Managed Kafka-alike. Less throughput. Easier ops. Good default for GCP shops.
AWS SNS + SQS. SNS fans out, SQS queues. Compose for "pub-sub with consumer-group semantics." Per-message pricing. Good for low-medium scale on AWS.
NATS / NATS JetStream. Lightweight, fast. Growing. Good for cloud-native low-latency.
Redis Streams / Redis Pub/Sub. Cheap, simple. Not durable enough for serious work. Good for dev or coordination signals.
Apache Pulsar. Newer than Kafka, similar feature set, better multi-tenancy. Less mature ecosystem.
What I would recommend
For a new system at small to medium scale:
- Start with HTTP for synchronous calls and SNS+SQS for events.
- Add a schema registry (or at minimum, Protobuf in all messages).
- Define DLQs.
- Monitor lag.
Move to Kafka when:
- You need replay.
- You need >10k msg/sec sustained.
- You have multiple consumer groups for the same event stream.
Premature Kafka adoption is one of the most common architectural mistakes. The ops cost is real. Don't pay it before you need to.
Learn more
- DocsKafka definitive guideApache Kafka
- DocsGoogle Pub/Sub overviewGoogle Cloud docs
- DocsConfluent: schema registryConfluent docs
- TalkMartin Kleppmann: a log-structured history of computingMartin Kleppmann
- ArticleDesigning Data-Intensive Applications, ch 11Martin Kleppmann