Deep dive15 min read← Back to crisp

Webhook ingestion (Spur question)

How to build a production webhook receiver that handles 10k events/sec, verifies signatures, dedupes, and survives provider retries.

Webhooks look like a free lunch. The provider tells you when something changes; you avoid polling; everyone wins. Then you ship it to production and discover ten failure modes nobody mentioned. This is the playbook I would write for any team building a webhook receiver from scratch.

The architecture in one sentence

A thin synchronous ingestion tier verifies and enqueues; a fat asynchronous processor handles business logic. Anything else is a bug waiting to happen.

Two tiers: stateless verify/enqueue and stateful async processing

Why two tiers

Webhook providers retry aggressively. Stripe retries for 3 days with exponential backoff (5 retries by default). If your handler takes 8 seconds because you're calling a third-party API, Stripe times out at 5 seconds, marks the call failed, and retries. You process the same event 5+ times. Without dedupe, every retry duplicates the side effect.

The fix is to do nothing slow in the handler. Verify the signature (fast), persist the raw payload (fast), return 200. The processor does the slow work later, asynchronously, with its own retry semantics.

Signature verification

Every webhook provider signs the payload. The signature is HMAC-SHA256 of the raw request body using a shared secret. The header contains the signature, often with the timestamp and a scheme version.

Stripe's header looks like: Stripe-Signature: t=1718956800,v1=abc123...

Verification:

import hmac, hashlib, time
 
def verify(secret, raw_body, header, max_age=300):
    parts = dict(item.split('=') for item in header.split(','))
    timestamp = int(parts['t'])
    signature = parts['v1']
 
    # 1. Reject replays older than 5 minutes
    if time.time() - timestamp > max_age:
        return False
 
    # 2. Compute expected signature
    signed_payload = f"{timestamp}.{raw_body.decode()}".encode()
    expected = hmac.new(secret, signed_payload, hashlib.sha256).hexdigest()
 
    # 3. Constant-time compare
    return hmac.compare_digest(expected, signature)

Three things to never get wrong:

Use the raw body bytes. JSON re-serialization will change whitespace and break the signature. Read the raw body before any parsing middleware touches it.
Use hmac.compare_digest. Regular == leaks timing information. An attacker can guess the signature byte-by-byte by measuring response times.
Verify the timestamp. Without it, an attacker who captures one valid payload can replay it forever. The 5-minute window is Stripe's default.

Persist before ack

The 200 response tells the provider "I have this, don't retry." If you return 200 before durably persisting, and your server crashes 10ms later, the event is lost forever. There is no get-historic-webhook API for most providers.

The persistence target must be durable across pod restarts. Options:

Managed queue (SQS, Pub/Sub, Kafka). Best for high scale. Built-in durability, retries, DLQ.
Postgres raw_events table. Simpler. Just one table, append-only, processed by a worker reading with FOR UPDATE SKIP LOCKED.
Disk + cron. Don't.

I default to SQS or Kafka. Postgres works up to a few hundred events per second.

@app.post("/webhook/stripe")
def ingest(request):
    raw = request.get_data()
    sig = request.headers.get("Stripe-Signature")
 
    if not verify(SECRET, raw, sig):
        return Response(status=401)
 
    try:
        sqs.send_message(QueueUrl=Q, MessageBody=raw)
    except Exception:
        # Queue down, force provider retry
        return Response(status=503)
 
    return Response(status=200)

Verify, enqueue, ack. Three steps. Each <10ms. Pod can do thousands per second.

Deduplication

Every webhook provider sends an event ID. Stripe sends id: evt_.... GitHub sends X-GitHub-Delivery: <UUID>. Use it.

In the worker, before doing the business logic:

INSERT INTO processed_events (event_id, processed_at)
VALUES ($1, now())
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id;

If the INSERT returns nothing (conflict), skip. If it returns the ID, process.

This handles two failure modes:

The provider retried even though you returned 200 (race in their system).
Your worker crashed after processing but before marking complete, and the queue redelivered.

Combined with idempotent business logic, you get exactly-once semantics from the application's perspective.

Ordering

Webhooks are not ordered. You will receive customer.subscription.updated for version 5 before version 3 because of retries, parallel delivery, network jitter. Plan for this.

Two patterns:

Version field (Stripe, GitHub). Each event has a version or updated_at. Worker checks: if event version <= current DB version, skip.

UPDATE subscriptions
SET data = $1, version = $2
WHERE id = $3 AND version < $2;

If 0 rows updated, the event was stale. Skip.

Reorder buffer. Stash events for a short window (say, 30 seconds), process in timestamp order. Adds latency, rarely worth it.

I always use the version field approach. It is simpler and works at any scale.

Provider retry policies

Stripe: 3 days, 5 attempts, exponential backoff.

GitHub: 8 hours, with backoff.

Twilio: 3 attempts over 1 hour.

Most "retries on non-2xx." Some "retries on no response within 10s." Read the docs for every provider.

Implications:

A 5-minute outage of your ingestor results in retries, not lost events. Good.
A bug that returns 200 on a malformed event swallows it forever. Bad.
A 3-day outage might lose old events depending on provider. Plan for backfills.

Backfills and reconciliation

Even with perfect ingestion, you will eventually need to reconcile. Some webhooks will be lost (provider bug, your config wrong, signature secret rotated). For any provider you depend on, build a reconciliation job that pulls the source of truth nightly.

For Stripe: hit GET /events and diff against your processed_events table. Replay missing events through your worker.

This is the "trust but verify" pattern. Webhooks are the hot path. Reconciliation is the cold backstop.

Scaling to 10k events/sec

This is the Spur question. The architecture:

CDN / WAF. Cloudflare in front. Drops obvious garbage (no signature header, wrong content type).
Stateless ingest pods. 20-50 pods behind an ALB. Each handles 200-500 req/s. Memory: minimal. CPU: HMAC is fast, ~50 microseconds per request.
Kafka or SQS. Durable buffer. SQS handles 10k/sec out of the box. Kafka scales further.
Worker autoscaler. Scale on queue depth. If backlog grows, spin up workers. Each worker handles 50-200 events/sec depending on business logic complexity.
Postgres for state. processed_events table partitioned by week. subscriptions table updated transactionally with dedupe insert.

Bottlenecks usually hit at the database. Mitigate by batching writes, partitioning hot tables, using read replicas for the reconciliation queries.

Security threats

Signature replay. Mitigated by timestamp window.

Timing attack on HMAC. Mitigated by constant-time compare.

Compromised secret. Rotate quarterly. Support two valid secrets during rotation windows.

Endpoint enumeration. Use a long random URL like /webhook/stripe/8f3a9c.... Defense in depth.

DOS via valid signatures. A compromised provider account could send millions of valid events. Rate limit per provider account.

Observability

Track:

Ingest success rate (signature valid / total). Drops mean rotated secret or attack.
Queue depth and consumer lag. Rising depth means processor too slow.
Event age (now - timestamp at processing). Spikes indicate downstream issues.
Dedupe rate (skipped / total). High means provider retries or your acks are slow.

Alert on all four.

What I'd build today

For a SaaS receiving Stripe + a few internal webhooks:

Cloudflare → Go-based ingest service (HMAC is much faster in Go) → SQS → Python workers → Postgres.
Dedupe table partitioned by week, 90-day retention.
Nightly reconciliation job pulling GET /events from each provider.
Grafana dashboard for the four metrics above.
Runbook for the most common failures: secret rotation gone wrong, queue full, processor too slow.

Boring, predictable, scales. That is the whole point.

Learn more

Docs
Stripe: Receiving webhooksStripe docs
Docs
Stripe: Best practices for webhook receiversStripe docs
Docs
GitHub: Validating webhook deliveriesGitHub docs
Docs
Svix webhook docsSvix
Docs
AWS: SQS for webhook ingestionAWS docs