New

Opentelemetry Integration: Distributed Tracing, Metrics, And Logging For Ai Systems

What is OpenTelemetry and Why It Matters Now

OpenTelemetry (OTel) is an open standard for achieving observability in distributed systems. It handles the three pillars — traces, metrics, and logs — through a unified API.

In microservice and AI Agent systems, it's difficult to identify "where is the slow request?" and "which LLM call is eating costs?" OpenTelemetry solves this problem.

Backend options:

Jaeger: OSS distributed tracing (self-hosted)
Grafana Tempo + Prometheus: Metrics + traces integration
Datadog / Honeycomb: Managed services
Signoz: OSS full-stack observability

Basic Setup: Instrumentation for Node.js

// src/instrumentation.ts - Must run before app starts  
import { NodeSDK } from "@opentelemetry/sdk-node";  
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";  
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";  
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";  
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";  
import { Resource } from "@opentelemetry/resources";  
  
const resource = new Resource({  
  "service.name": "my-ai-service",  
  "service.version": process.env.npm_package_version ?? "0.0.0",  
  "deployment.environment": process.env.NODE_ENV ?? "development",  
});  
  
export const sdk = new NodeSDK({  
  resource,  
  traceExporter: new OTLPTraceExporter({  
    url: process.env.OTEL_ENDPOINT ?? "http://localhost:4318/v1/traces",  
  }),  
  metricReader: new PeriodicExportingMetricReader({  
    exporter: new OTLPMetricExporter(),  
    exportIntervalMillis: 15_000,  
  }),  
  instrumentations: [getNodeAutoInstrumentations()],  
});  
  
sdk.start();  
process.on("SIGTERM", () => sdk.shutdown().finally(() => process.exit(0)));

Custom Traces: Instrumenting LLM Calls

import { trace, SpanStatusCode, SpanKind } from "@opentelemetry/api";  
import Anthropic from "@anthropic-ai/sdk";  
  
const tracer = trace.getTracer("llm-service", "1.0.0");  
  
async function tracedLLMCall(prompt: string, model = "claude-sonnet-4-5"): Promise<string> {  
  return tracer.startActiveSpan("llm.call", {  
    kind: SpanKind.CLIENT,  
    attributes: { "llm.model": model, "llm.prompt_length": prompt.length, "llm.provider": "anthropic" },  
  }, async (span) => {  
    try {  
      const client = new Anthropic();  
      const startTime = Date.now();  
  
      const response = await client.messages.create({  
        model, max_tokens: 1024,  
        messages: [{ role: "user", content: prompt }],  
      });  
  
      span.setAttributes({  
        "llm.input_tokens": response.usage.input_tokens,  
        "llm.output_tokens": response.usage.output_tokens,  
        "llm.latency_ms": Date.now() - startTime,  
      });  
  
      span.setStatus({ code: SpanStatusCode.OK });  
      return response.content[0].text;  
    } catch (error) {  
      span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });  
      span.recordException(error as Error);  
      throw error;  
    } finally {  
      span.end();  
    }  
  });  
}

Custom Metrics: Measuring Business KPIs

import { metrics, ValueType } from "@opentelemetry/api";  
  
const meter = metrics.getMeter("ai-service", "1.0.0");  
  
const requestCounter = meter.createCounter("api.requests.total", {  
  description: "Total number of API requests",  
});  
  
const latencyHistogram = meter.createHistogram("api.latency.ms", {  
  description: "API request latency in milliseconds",  
  unit: "ms",  
  advice: {  
    explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],  
  },  
});  
  
const activeConnectionsGauge = meter.createObservableGauge("db.connections.active");  
activeConnectionsGauge.addCallback((result) => {  
  result.observe(pool.totalCount - pool.idleCount, { db: "primary" });  
});

Correlated Logs and Traces

import { trace } from "@opentelemetry/api";  
import pino from "pino";  
  
const logger = pino({  
  mixin() {  
    const span = trace.getActiveSpan();  
    if (!span) return {};  
    const ctx = span.spanContext();  
    return { traceId: ctx.traceId, spanId: ctx.spanId };  
  },  
});  
  
// Now logs and traces are correlated  
// Search by trace ID in Jaeger to find corresponding logs  
logger.info({ userId: 123 }, "User logged in");

Docker Compose: OTel Collector + Jaeger

version: "3.8"  
  
services:  
  otel-collector:  
    image: otel/opentelemetry-collector-contrib:0.96.0  
    command: ["--config=/etc/otel-collector-config.yaml"]  
    volumes:  
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml  
    ports:  
      - "4317:4317"  
      - "4318:4318"  
  
  jaeger:  
    image: jaegertracing/all-in-one:1.55  
    ports:  
      - "16686:16686"  # Jaeger UI  
  
  prometheus:  
    image: prom/prometheus:v2.50.0  
    ports:  
      - "9090:9090"  
  
  grafana:  
    image: grafana/grafana:10.3.0  
    ports:  
      - "3000:3000"

Implementing OpenTelemetry makes "what's slow" and "what's eating costs" visible. Especially instrumenting LLM calls is a critical investment that directly contributes to AI system optimization.

This article is from the Claude Code Complete Guide (7 chapters) on note.com.
myouga (@myougatheaxo) - VTuber axolotl. Sharing practical AI development tips.

Back to Listing

credit: