Distributed Tracing with OpenTelemetry

The Problem

In the Entegrasys project, 10 shipping carriers and 4 ERP systems are managed through a single orchestration layer. When an order is placed, the request passes through an average of 7 services: API gateway, auth service, order service, ERP connector, and carrier adapter.

When something slows down or errors in this chain, pinpointing the exact location — with each service using its own log format — could take hours. OpenTelemetry solved this.

What OpenTelemetry Is (and Isn't)

OpenTelemetry (OTel) is a vendor-agnostic standard for producing, collecting, and exporting telemetry data — traces, metrics, logs. It is not a storage or visualization tool; it operates on the production side of the pipeline.

The practical implication: a service instrumented with OTel can send traces to Jaeger, Grafana Tempo, or Datadog. Switching vendors requires no code changes.

Instrumentation Strategy

Auto vs Manual Instrumentation

For Node.js services, OTel's auto-instrumentation package automatically instruments HTTP, gRPC, database queries, and many popular libraries:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

This single file turns everything from Express endpoints to Prisma queries into spans automatically.

Manual instrumentation is necessary to add visibility into business logic:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('order.process', async (span) => {
    span.setAttribute('order.id', orderId);
    span.setAttribute('order.source', 'api');
    
    try {
      const result = await doWork(orderId);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.recordException(err as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

Context Propagation

Context propagation is the lifeblood of distributed tracing. As a request moves from one service to another, the trace ID and span ID travel in HTTP headers (traceparent, tracestate).

OTel handles this propagation automatically — but if a layer in the chain breaks propagation (reverse proxy, queue, third-party SDK), the chain breaks.

Common Pitfall

Async operations over message queues (Kafka, RabbitMQ) do not automatically carry trace context. You must write explicit code to attach traceparent to message headers and restore it on the consumer side.

Collector Configuration

The OTel Collector receives traces, metrics, and logs from services, processes them, and forwards them to backend storage. In Entegrasys we built this pipeline:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Sampling Strategy

Storing every trace in production is both expensive and unnecessary. We defined a smart sampling strategy:

Head-based sampling: 10% random sampling as baseline for all services
Tail-based sampling: All requests over 500ms are stored at 100%
Error sampling: All traces containing errors are stored at 100%
Business sampling: Critical paths (payments, orders) sampled at 100%

Tail-based sampling was applied at the OTel Collector level, so the full trace information is available before making the decision.

Visualization with Grafana Tempo

We used Grafana Tempo for trace storage. Its native integration with Prometheus lets us display traces and metrics side by side in Grafana.

The most valuable feature: jumping from a Prometheus alert directly to the relevant trace. When an endpoint's P99 latency spikes, a single click from the graph to the trace.

Results

After taking OTel to production:

Average incident detection time dropped from 47 minutes to 8 minutes
Slow ERP connectors identified from real data; N+1 query problems found in 3 carrier adapters
New developer onboarding time reduced: system behavior is now readable from traces

Conclusion

OpenTelemetry is a mature, production-ready standard. It can be set up without vendor lock-in, and its broad ecosystem support means you can get started quickly. If you work with microservices or distributed systems, we recommend moving OTel from your "future improvements" list into today's infrastructure.