Full-Stack Observability with OpenTelemetry, Prometheus, and Grafana
Monitoring vs. Observability
Monitoring tells you when a system is broken. Observability tells you why it's broken.
In a distributed system, a single user request might touch 5 different microservices and 3 databases. If it's slow, where is the bottleneck? We solve this with OpenTelemetry.
Distributed Tracing
OpenTelemetry automatically injects a trace_id into HTTP headers, passing it along through every service. This allows us to visualize the entire lifecycle of a request in Grafana Jaeger.
// Initializing OpenTelemetry in Node.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Metrics with Prometheus
Alongside traces, we export metrics (CPU usage, memory, active database connections) to Prometheus. We build Grafana dashboards that alert our engineering team if the P99 latency of an API drops below acceptable thresholds.
This stack reduces our Mean Time to Resolution (MTTR) from hours of hunting through logs, to minutes of inspecting a trace.