Deep dive15 min read← Back to crisp

Grafana and Observability - Deep Dive

Prometheus data model, PromQL essentials, log aggregation patterns, distributed tracing, SLO math, and the alerting philosophy that lets you sleep.

Observability is engineering discipline more than tooling. The tools are commodities. The discipline is knowing what to measure, what to alert on, and how to debug an incident at 2 am. This is the playbook.

Metrics: Prometheus data model

A metric is a name + a set of key-value labels + a numeric value over time. Each unique combination of (name, labels) is a separate time series.

http_requests_total{method="GET", status="200", endpoint="/api/users"} 4823
http_requests_total{method="GET", status="500", endpoint="/api/users"} 12
http_requests_total{method="POST", status="201", endpoint="/api/users"} 982

Metric types:

Counter: monotonically increasing (request count, errors, bytes sent). Use rate() to get per-second.
Gauge: arbitrary value that goes up and down (memory in use, queue depth, temperature).
Histogram: buckets a distribution (request latency). Auto-generates _bucket, _sum, _count series. Compute percentiles with histogram_quantile().
Summary: client-side computed percentiles. Cheaper queries, but you cannot aggregate across pods. Prefer histogram.

PromQL essentials

# request rate per service over 5 min
sum by (service) (rate(http_requests_total[5m]))
 
# error rate (errors / total)
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))
 
# p95 latency from histogram
histogram_quantile(0.95,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# memory used by pod
container_memory_working_set_bytes{namespace="prod"}

The mistake everyone makes: rate(counter[1m]) on a 15-second scrape interval gives noisy results. Use at least 4x your scrape interval for the rate window.

Cardinality, the silent killer

Each unique label combination is a separate time series. Each time series consumes RAM in Prometheus (about 3-4 KB resident). 1 million series = 3-4 GB just for indexes.

Bad labels:

user_id: if you have 1M users, each metric has 1M series.
request_id: every request is a new series. Boom.
path for paths with IDs (/api/users/42): use /api/users/:id instead.

Good labels:

endpoint: fixed enum.
method: small set.
status_code: small set.
service: small set.

Aim for under 100K active series per Prometheus instance. Beyond that, look at Thanos, Cortex, or Mimir for horizontal scaling.

Logs: structured, sampled, indexed by metadata

Plain text logs are dead. Use structured JSON: every log line has timestamp, level, service, trace_id, user_id, message, plus context fields.

{"ts":"2026-06-21T12:34:56Z","level":"error","service":"payments","trace_id":"abc123","user_id":"u_42","msg":"stripe webhook failed","err":"signature mismatch"}

Loki indexes only labels (service, level, region), not log content. This is cheap to operate but means full-text search uses sequential scan over short windows. Different from Elasticsearch, which indexes everything (expensive) and gives you instant full-text.

For high-volume logs: sample. Keep 100 percent of errors, 10 percent of info. Use head sampling at the source.

Traces: OpenTelemetry essentials

A trace is a tree of spans. Each span has a name, start/end time, parent span ID, attributes (key-value tags). The root span is the entry point (HTTP request, queue message). Child spans are the work inside.

OpenTelemetry SDK auto-instruments HTTP, gRPC, DB drivers, Kafka. Custom spans for business logic:

from opentelemetry import trace
tracer = trace.get_tracer(__name__)
 
with tracer.start_as_current_span("process_invoice") as span:
    span.set_attribute("invoice.amount", amount)
    span.set_attribute("invoice.currency", currency)
    do_work()

Sampling: keep all traces with errors, sample 1-10 percent of successful ones. Tail sampling (decide after the trace completes) gives you better coverage of slow/error traces.

Correlation: the magic

Every log line includes trace_id. Every metric has a service label. Grafana lets you click from a metric panel into the logs of that service in the same time window, then jump from a log line to the trace it belongs to. This three-way correlation is what observability really buys you.

To make it work:

Propagate trace context (W3C traceparent header) across all service boundaries.
Log the trace_id in every structured log.
Tag metrics with consistent labels (service, env, version).

SLOs and error budgets

Define what success means: "99.9 percent of HTTP requests complete in under 500 ms with status 2xx or 3xx, measured over a 28-day window."

That gives you an error budget: 0.1 percent of requests can fail. For 10M requests/month, that is 10,000 failures allowed.

Burn rate alerts (Google SRE multi-window-multi-burn-rate):

2 percent of budget burned in 1 hour: page (fast burn, real incident).
5 percent of budget burned in 6 hours: page (slower burn, still real).
10 percent burned in 3 days: ticket (chronic issue, fix in business hours).

The math: 1 hour at 14.4x normal error rate burns 2 percent of a monthly budget.

Alert philosophy

Alerts come in three flavors:

Page: wake someone up. Reserved for active user-visible incidents.
Ticket: file a Jira; address within a business day.
Email/Slack: informational, no action required.

Anti-patterns:

Alerting on causes (high CPU) instead of symptoms (high latency).
Alerting on single data points (a 200 ms latency spike) instead of sustained windows.
Pages without runbooks. If the page does not link to a "here is how to investigate," it should not page.
Pages that have fired more than 10 times without resolution. Either fix the underlying issue or change the alert.

Dashboards that work

A dashboard has a purpose. Common templates:

Service overview: RED metrics for one service, with annotations for deploys.
Cluster overview: capacity, saturation, error rates across all services.
Incident triage: latency by endpoint, error rate by code, recent deploys, traffic source breakdown.
SLO compliance: burn rate, error budget remaining, time to exhaustion.

Anti-pattern: the 40-panel dashboard. Nobody reads it. Three to seven panels per dashboard, each answering one question.

On-call hygiene

The on-call shift is a feedback loop. Every page should result in:

Acknowledged within minutes.
Diagnosed using dashboards, logs, traces.
Mitigated (rollback, scale up, flag off).
Postmortem within 48 hours if user-impacting.
Action items to prevent recurrence.

The postmortem is blameless and writes down: timeline, impact, root cause, contributing factors, what went well, what went wrong, action items with owners.

What I actually shipped at Binocs

Per-service dashboards with RED metrics, deploy annotations, error log feed embedded.
A "production health" dashboard for the morning standup: SLO status for all services, error budget remaining, recent incidents.
Burn rate alerts piped to Slack #ops and PagerDuty.
A runbook wiki linked from every alert.
Trace-log-metric correlation working end-to-end (took two weeks of plumbing).

The mental shift

Observability is not "more graphs." It is the ability to debug a problem you have never seen before, with the data you already collect. Test it by playing chaos engineer once a month: kill a pod, throttle a network link, fail a dependency. Can you diagnose it from dashboards alone? If not, add the missing telemetry.

Learn more

Docs
Prometheus DocumentationPrometheus
Docs
Grafana DocumentationGrafana
Article
Google SRE BookGoogle SRE
Article
Google SRE WorkbookGoogle SRE
Docs
OpenTelemetry DocumentationOpenTelemetry
Article
Brendan Gregg: USE MethodBrendan Gregg