Grafana and Observability
The three pillars (metrics, logs, traces), the stack I used at Binocs, and how to actually know when production is broken.
Observability is the ability to ask new questions about your system without deploying new code. The three pillars are metrics (numerical aggregates), logs (discrete events), and traces (causal request flow). You need all three.
The stack at Binocs
- Metrics: Prometheus scraping
/metricsfrom every pod, federated across clusters. - Logs: Fluent Bit forwarding pod logs to Loki, structured JSON.
- Traces: OpenTelemetry SDK in each service, exported to Tempo.
- Visualization and alerting: Grafana on top of all three.
The four golden signals
Every service exposes:
- Latency: p50, p95, p99 per endpoint.
- Traffic: requests per second per endpoint.
- Errors: error rate per endpoint, broken down by status.
- Saturation: CPU, memory, connection pool, queue depth.
These four cover 90 percent of "is it working?" Add custom business metrics for what your service actually does (invoices processed per minute, payment success rate, etc).
RED method for request-driven services
- Rate
- Errors
- Duration
Three metrics per endpoint, dashboard them per service. If something is wrong with a service, RED tells you immediately.
USE method for resources
- Utilization (percent busy)
- Saturation (queue depth)
- Errors (per resource)
For CPU, memory, disk, network, DB connections.
Alerts, the right way
Alert on symptoms (user-visible), not causes. "p99 latency > 1s for 5 minutes" wakes me up. "CPU > 80 percent" does not, unless it correlates with user impact.
Define SLOs: 99.9 percent of requests under 500 ms, 99.95 percent successful. Burn rate alerts (Google SRE multi-window) catch fast burns (something broke now) and slow burns (we will exhaust the budget this month) separately.
Learn more
- DocsGrafana DocumentationGrafana
- DocsPrometheus DocumentationPrometheus
- Article