DevOps

NGINX + OpenTelemetry

by Sam — 3 months ago

Bring distributed tracing to your infrastructure

This article is also available in french

Navigating microservices without visibility is like sailing the ocean without a compass.

To avoid sinking, precise transaction mapping is essential. This is where the strategic alliance of two giants comes in: OpenTelemetry (OTel) and NGINX. Together, they form the backbone of robust distributed tracing, providing DevOps teams with unprecedented clarity to monitor and debug their applications.

Based on the official OpenTelemetry and NGINX module documentation, this article explores how to orchestrate this synergy. We will review its decisive advantages, limitations, and provide concrete examples to solidify your observability strategy.

Examining the OTel Module for NGINX

OpenTelemetry is an open-source tool that collects traces, metrics, and logs to make your systems transparent. It standardizes observability with APIs, SDKs, and a Collector that integrates with over 40 backends. The NGINX OpenTelemetry module (ngx_otel_module) enables NGINX to generate traces compliant with the W3C standard (notably via traceparent and tracestate headers) and send them via OTLP/gRPC. It doesn't handle direct metrics, but trace attributes (e.g., http.status_code) can do the job.

Why This Combo?

Combining OTel with NGINX is like giving your proxy X-ray glasses. You can follow requests end-to-end, even through a maze of microservices.

The Advantages:

Distributed Traces: Visualize the journey of requests through NGINX and backend services.
W3C Standard: Ensures interoperability with other OTel-compatible tools.
Customization: Add specific data (e.g., latency) for tailored monitoring.
Flexible Ecosystem: Export to Jaeger, Prometheus (via transformations), and more.

But beware of the pitfalls: compiling NGINX with the module, configuring a Collector, or managing the overhead of traces without sampling.

Implementation: NGINX and OTel in Action

Let's take a typical case: NGINX relays requests to a backend, and you want to know why some are slow.

Step 1: Install the OTel Module

The module is not included by default. Compile NGINX with:

./configure --with-compat --add-dynamic-module=/path/to/ngx_otel_module
make && make install

Add this to nginx.conf:

load_module modules/ngx_otel_module.so;

Step 2: Configure Tracing

Here is a sample configuration to trace /api requests and export them to an OTel Collector:

http {
    otel_exporter {
        endpoint localhost:4317; # OTel Collector
        batch_size 512;
        batch_count 4;
        interval 5s;
    }

    server {
        listen 80;
        server_name example.com;

        location /api {
            otel_trace on;
            otel_trace_context propagate; # Link with the backend
            otel_span_attr "latency=$request_time"; # Custom latency attribute
            proxy_pass http://backend:8080;
        }
    }
}

otel_trace on; enables tracing.
otel_trace_context propagate; links NGINX traces to the backend via W3C headers.
otel_span_attr captures request latency.
Traces are sent in batches to avoid overhead.

Step 3: Deploy an OTel Collector

The Collector processes and exports traces. Example configuration (config.yaml):

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  batch:
    timeout: 5s
exporters:
  otlp:
    endpoint: jaeger:4317
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

Traces flow from NGINX to the Collector, then to Jaeger for visualization.

Step 4: Optimize with Sampling

For high-traffic sites, limit tracing to 10% of requests:

http {
    split_clients "${remote_addr}AAA" $otel_trace_on {
        10% on;
        * off;
    }

    server {
        location /api {
            otel_trace $otel_trace_on;
            otel_trace_context propagate;
            proxy_pass http://backend:8080;
        }
    }
}

This reduces load while still collecting relevant data.

Example: Tracking Latency

Your /api endpoint is slow. Using Jaeger, you analyze the traces:

NGINX Span: http.status_code=200, latency=0.300s.
Backend Span: Latency of 0.250s.

You add otel_span_attr "upstream_time=$upstream_response_time"; and discover NGINX adds 0.050s. Perhaps a configuration or network issue? You adjust and monitor for improvement.

Why Add This Attribute?

Adding upstream_time=$upstream_response_time as a span attribute enriches traces with specific information about the backend server's performance. This allows you to monitor upstream server latency in tracing tools like Jaeger, Zipkin, or Grafana Tempo and correlate this data with other metrics or logs to diagnose problems.

When to Use It?

This combo shines in:

Microservices architectures with NGINX as a gateway. (A nod to NGINX users and those using an Ingress NGINX).
Scenarios requiring precise performance debugging.
Teams wanting a unified observability pipeline.

Conclusion

NGINX and OpenTelemetry form a powerful duo that enhances visibility at strategic points in your infrastructure. With a few configuration lines and a collector, you gain access to insights that usually require significant resources and advanced DevOps skills. This can help you move from blurry to crystal clear when it comes to understanding your requests. Test it with the examples above and empower your NGINX service!