End-to-End Ingress Request Tracing for Multi-Tenant SaaS

OpenTelemetry
End-to-End Ingress Request Tracing for Multi-Tenant SaaS

Why Multi-Tenant Tracing Is Harder Than It Looks

In a single-tenant system, distributed tracing is almost plug-and-play. Instrument your services with OpenTelemetry, point the collector at Jaeger or Tempo, and you’re done. In a multi-tenant SaaS platform, you have a harder problem: a single slow trace could belong to any of your hundreds of tenants, and when Tenant A calls your support line at 2 AM, you need to find their traces instantly — not wade through a sea of spans from everyone else.

The root issue is that HTTP requests don’t inherently carry tenant identity through distributed systems. By the time a request reaches your payment service four hops deep, the tenant ID from the original JWT has usually vanished. This guide walks through a concrete pattern for solving that: injecting tenant context at the ingress layer, propagating it via W3C Baggage through every hop, and surfacing it in tenant-scoped dashboards.

Step 1: Inject Tenant ID at the Ingress

Your ingress controller — whether that’s NGINX, Envoy, or a cloud-native API gateway — is the right place to extract and forward tenant identity. At this layer you have access to the full request: JWT claims, custom headers, subdomain, or API key. Extract the tenant ID here and add it to two places:

  • A span attribute on the root trace span
  • A W3C Baggage header (baggage: tenant_id=acme-corp) so downstream services can read it without re-parsing the original JWT

Here’s a minimal Envoy Lua filter that reads a X-Tenant-ID header and injects it into the baggage:

function envoy_on_request(request_handle)
  local tenant = request_handle:headers():get("x-tenant-id")
  if tenant then
    local baggage = request_handle:headers():get("baggage") or ""
    if baggage ~= "" then
      baggage = baggage .. ","
    end
    request_handle:headers():replace("baggage", baggage .. "tenant_id=" .. tenant)
  end
end

If your tenants are identified by subdomain (e.g., acme.yourapp.com), parse the Host header instead. The key principle: one place, one source of truth for tenant identity injection. Validate the tenant against a signed JWT or API key lookup at the ingress — never trust a client-supplied header directly.

Step 2: Configure the OpenTelemetry Collector

The OTel Collector sits between your services and your observability backend (Tempo, Jaeger, Datadog, etc.). With one processor block, it can extract baggage values and promote them to span attributes — making tenant ID filterable across your entire backend without touching application code.

Add a transform processor to your collector config:

processors:
  transform/tenant:
    trace_statements:
      - context: span
        statements:
          - set(attributes["tenant.id"], baggage["tenant_id"])

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform/tenant, batch]
      exporters: [otlphttp]

This single block means every span in your system carries tenant.id as a first-class attribute the moment the baggage header is present. No application-level changes required for services that already propagate W3C trace context.

Cardinality note: Tenant ID is a safe span attribute because your tenant count is bounded. Don’t promote unbounded values — user IDs, session tokens, request bodies — into span attributes or your backend will buckle under the cardinality load.

Step 3: Instrument Downstream Services

For services you control, read the baggage value and write it explicitly to the current span. This keeps the attribute alive even if an internal proxy strips baggage headers.

Go:

import (
    "go.opentelemetry.io/otel/baggage"
    "go.opentelemetry.io/otel/trace"
)

func HandleRequest(ctx context.Context) {
    bag := baggage.FromContext(ctx)
    tenantID := bag.Member("tenant_id").Value()

    span := trace.SpanFromContext(ctx)
    span.SetAttributes(attribute.String("tenant.id", tenantID))
}

Python:

from opentelemetry import baggage, trace

def handle_request():
    tenant_id = baggage.get_baggage("tenant_id")
    span = trace.get_current_span()
    span.set_attribute("tenant.id", tenant_id)

The pattern is identical across languages: read from baggage context, write to span attributes. Ship a small shared middleware in each language your org uses so engineers don’t re-implement this per service.

Handling Async Boundaries

Message queues silently break trace context propagation. When a service publishes to Kafka or RabbitMQ, the baggage needs to travel in message headers, not the HTTP layer. Use the OTel propagator API:

from opentelemetry.propagate import inject, extract

# Publisher
headers = {}
inject(headers)  # Injects traceparent, baggage, etc.
producer.send("orders", value=payload, headers=list(headers.items()))

# Consumer
ctx = extract(dict(msg.headers))
with tracer.start_as_current_span("process-order", context=ctx):
    process(msg.value)

This keeps the trace stitched together across async boundaries and delivers tenant ID intact on the consumer side.

Step 4: Build Tenant-Scoped Dashboards

Once tenant.id is a span attribute in your backend, filtering is one line of query language.

Grafana + Tempo

In TraceQL, filter by tenant instantly:

{ span.tenant.id = "acme-corp" }

Build a dashboard variable $tenant_id driven by a label query or a static list, then parameterize every panel. On-call engineers switch tenants from a dropdown rather than editing queries at 2 AM.

Jaeger

Jaeger’s UI supports tag-based filtering from the search screen. Set tenant.id=acme-corp as a tag filter. For programmatic access via the HTTP API:

GET /api/traces?service=api-gateway&tags=tenant.id%3Dacme-corp&limit=20

Tenant-Aware Alerting

Wire latency alerts to your highest-tier tenants specifically:

histogram_quantile(0.99,
  sum by (le, tenant_id) (
    rate(http_request_duration_seconds_bucket{tenant_id="enterprise-tier"}[5m])
  )
) > 2.0

Tie alert routing to tenant tier metadata so enterprise customers get faster SLA response than free-tier users.

Common Pitfalls

Baggage stripped at the service mesh layer. Istio and Linkerd don’t forward all headers by default. Verify that baggage and traceparent are in your mesh’s header passthrough list, or you’ll lose context silently and never know why.

Skipping the collector processor. Without the transform/tenant step, tenant ID lives only in baggage and won’t appear in span attribute searches in your backend. The collector processor is what makes the attribute queryable — don’t skip it.

PII in baggage. Baggage travels in plaintext through every log, proxy, and third-party SDK. Stick to non-sensitive identifiers like a tenant slug or UUID. Never put email addresses or customer names in baggage.

Assuming propagation works without testing. Instrument a test tenant, fire a request, and trace it all the way through your stack before declaring success. Silent propagation failures are common and easy to miss in code review.

The Full Flow

Put it all together and the request lifecycle looks like this:

  1. Client request arrives → ingress validates auth, extracts tenant ID from JWT
  2. Ingress sets baggage: tenant_id=acme-corp and creates a root span with tenant.id
  3. Request traverses services → each reads baggage, writes tenant.id to its local span
  4. Async messages carry context in headers → consumers extract and continue the trace
  5. OTel Collector receives all spans, runs transform/tenant, forwards to backend
  6. On-call engineer opens Grafana, selects tenant from dropdown, sees the full trace

The investment is a few hours of configuration and one small shared library per language. The return is the ability to answer “what is Tenant X experiencing right now?” in seconds rather than minutes — which is exactly what enterprise SaaS customers expect when they’re paying for uptime guarantees.

Sources