Crisp5 min readGo deeper →

Grafana and Observability

The three pillars (metrics, logs, traces), the stack I used at Binocs, and how to actually know when production is broken.

Observability is the ability to ask new questions about your system without deploying new code. The three pillars are metrics (numerical aggregates), logs (discrete events), and traces (causal request flow). You need all three.

The stack at Binocs

Metrics: Prometheus scraping /metrics from every pod, federated across clusters.
Logs: Fluent Bit forwarding pod logs to Loki, structured JSON.
Traces: OpenTelemetry SDK in each service, exported to Tempo.
Visualization and alerting: Grafana on top of all three.

The four golden signals

Every service exposes:

Latency: p50, p95, p99 per endpoint.
Traffic: requests per second per endpoint.
Errors: error rate per endpoint, broken down by status.
Saturation: CPU, memory, connection pool, queue depth.

These four cover 90 percent of "is it working?" Add custom business metrics for what your service actually does (invoices processed per minute, payment success rate, etc).

RED method for request-driven services

Rate
Errors
Duration

Three metrics per endpoint, dashboard them per service. If something is wrong with a service, RED tells you immediately.

USE method for resources

Utilization (percent busy)
Saturation (queue depth)
Errors (per resource)

For CPU, memory, disk, network, DB connections.

Alerts, the right way

Alert on symptoms (user-visible), not causes. "p99 latency > 1s for 5 minutes" wakes me up. "CPU > 80 percent" does not, unless it correlates with user impact.

Define SLOs: 99.9 percent of requests under 500 ms, 99.95 percent successful. Burn rate alerts (Google SRE multi-window) catch fast burns (something broke now) and slow burns (we will exhaust the budget this month) separately.

Learn more

Docs
Grafana DocumentationGrafana
Docs
Prometheus DocumentationPrometheus
Article
Google SRE Book: Service Level ObjectivesGoogle SRE