Opentelemetry Integration: Distributed Tracing, Metrics, And Logging For Ai Systems
What is OpenTelemetry and Why It Matters Now
OpenTelemetry (OTel) is an open standard for achieving observability in distributed systems. It handles the three pillars — traces, metrics, and logs — through a unified API.
In microservice and AI Agent systems, it's difficult to identify "where is the slow request?" and "which LLM call is eating costs?" OpenTelemetry solves this problem.
Backend options:
- Jaeger: OSS distributed tracing (self-hosted)
- Grafana Tempo + Prometheus: Metrics + traces integration
- Datadog / Honeycomb: Managed services
- Signoz: OSS full-stack observability
Basic Setup: Instrumentation for Node.js
// src/instrumentation.ts - Must run before app starts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
const resource = new Resource({
"service.name": "my-ai-service",
"service.version": process.env.npm_package_version ?? "0.0.0",
"deployment.environment": process.env.NODE_ENV ?? "development",
});
export const sdk = new NodeSDK({
resource,
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_ENDPOINT ?? "http://localhost:4318/v1/traces",
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter(),
exportIntervalMillis: 15_000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on("SIGTERM", () => sdk.shutdown().finally(() => process.exit(0)));
Custom Traces: Instrumenting LLM Calls
import { trace, SpanStatusCode, SpanKind } from "@opentelemetry/api";
import Anthropic from "@anthropic-ai/sdk";
const tracer = trace.getTracer("llm-service", "1.0.0");
async function tracedLLMCall(prompt: string, model = "claude-sonnet-4-5"): Promise<string> {
return tracer.startActiveSpan("llm.call", {
kind: SpanKind.CLIENT,
attributes: { "llm.model": model, "llm.prompt_length": prompt.length, "llm.provider": "anthropic" },
}, async (span) => {
try {
const client = new Anthropic();
const startTime = Date.now();
const response = await client.messages.create({
model, max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
span.setAttributes({
"llm.input_tokens": response.usage.input_tokens,
"llm.output_tokens": response.usage.output_tokens,
"llm.latency_ms": Date.now() - startTime,
});
span.setStatus({ code: SpanStatusCode.OK });
return response.content[0].text;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}
Custom Metrics: Measuring Business KPIs
import { metrics, ValueType } from "@opentelemetry/api";
const meter = metrics.getMeter("ai-service", "1.0.0");
const requestCounter = meter.createCounter("api.requests.total", {
description: "Total number of API requests",
});
const latencyHistogram = meter.createHistogram("api.latency.ms", {
description: "API request latency in milliseconds",
unit: "ms",
advice: {
explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
},
});
const activeConnectionsGauge = meter.createObservableGauge("db.connections.active");
activeConnectionsGauge.addCallback((result) => {
result.observe(pool.totalCount - pool.idleCount, { db: "primary" });
});
Correlated Logs and Traces
import { trace } from "@opentelemetry/api";
import pino from "pino";
const logger = pino({
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const ctx = span.spanContext();
return { traceId: ctx.traceId, spanId: ctx.spanId };
},
});
// Now logs and traces are correlated
// Search by trace ID in Jaeger to find corresponding logs
logger.info({ userId: 123 }, "User logged in");
Docker Compose: OTel Collector + Jaeger
version: "3.8"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317"
- "4318:4318"
jaeger:
image: jaegertracing/all-in-one:1.55
ports:
- "16686:16686" # Jaeger UI
prometheus:
image: prom/prometheus:v2.50.0
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.3.0
ports:
- "3000:3000"
Implementing OpenTelemetry makes "what's slow" and "what's eating costs" visible. Especially instrumenting LLM calls is a critical investment that directly contributes to AI system optimization.
This article is from the Claude Code Complete Guide (7 chapters) on note.com.
myouga (@myougatheaxo) - VTuber axolotl. Sharing practical AI development tips.
Popular Products
-
Orthopedic Shock Pads For Arch Support$71.56$35.78 -
Remote Control Fart Machine$80.80$40.78 -
Adjustable Pet Safety Car Seat Belt$57.56$28.78 -
Adjustable Dog Nail File Board$179.56$89.78 -
Bloody Zombie Latex Mask For Halloween$123.56$61.78