Skip to content

Add opentelemetry to Armada#4973

Open
nikola-jokic wants to merge 5 commits into
masterfrom
nikola-jokic/otel
Open

Add opentelemetry to Armada#4973
nikola-jokic wants to merge 5 commits into
masterfrom
nikola-jokic/otel

Conversation

@nikola-jokic

Copy link
Copy Markdown
Contributor

What type of PR is this?

Enhancement

What this PR does / why we need it

Improve observability of armada services and their interactions

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds OpenTelemetry distributed tracing to all Armada services (server, scheduler, executor, binoculars, and the various ingesters) using the OTLP exporter, and introduces an HTTP and gRPC gateway instrumentation layer alongside a span attribute policy that enforces allow-lists, deny-lists, and cardinality guardrails to prevent PII leakage.

  • Core observability package (internal/common/observability): new InitOTel/ShutdownOTel lifecycle functions, ObservabilityConfig with validation and defaults, SpanAttributePolicy for attribute sanitization, and a sync.RWMutex-protected global tracer provider.
  • Service wiring: every main.go and app entry point gets a consistent InitOTel + deferred ShutdownOTel block; gRPC servers gain otelgrpc.NewServerHandler() and REST gateways are wrapped with otelhttp; Logger.WithContext() threads trace/span IDs into structured log fields.
  • Test infrastructure: InMemoryTracerCollector and TraceEvidence types added to the testsuite package so end-to-end trace coverage can be validated and written as evidence files.

Confidence Score: 4/5

The change is broadly safe to merge; the only concrete defect is that the server's internal gRPC client connection registers two OTel stats handlers simultaneously, producing duplicate client spans.

Every call made by the server's internal gRPC client (createApiConnection) will produce two client-side OTel spans per RPC because grpc.WithStatsHandler(otelgrpc.NewClientHandler()) is set both inside CreateApiConnectionWithCallOptions (for all callers) and again as an explicit additional option in server.go. gRPC v1.56+ invokes all registered stats handlers, so both fire. This pollutes traces with spurious duplicates and may skew sampling counts in production.

internal/server/server.go (duplicate stats handler in createApiConnection) and pkg/client/connection.go (new unconditional handler that callers need to be aware of).

Important Files Changed

Filename Overview
internal/common/observability/lifecycle.go Initializes global OTel tracer provider; now properly switches on protocol (grpc vs http/protobuf) and protects the global provider with a RWMutex — previous concerns addressed.
internal/common/observability/config.go Defines ObservabilityConfig with validation; sampler, protocol, and endpoint checks are thorough. WithDefaults fills in non-zero defaults before validating.
internal/common/observability/attribute_policy.go Implements allow-list, deny-list, and cardinality guardrails for span attributes. Post-start attribute additions are logged at OnEnd but cannot be mutated — limitation is clearly documented.
pkg/client/connection.go Unconditionally adds otelgrpc.NewClientHandler() to all connections created via CreateApiConnectionWithCallOptions; callers who also pass WithStatsHandler as an additional option will register duplicate handlers.
internal/server/server.go createApiConnection passes grpc.WithStatsHandler(otelgrpc.NewClientHandler()) as an additional dial option; combined with the same handler added inside CreateApiConnectionWithCallOptions, every RPC creates two client spans.
internal/testsuite/trace_evidence.go New trace evidence collection for the test suite; NewInMemoryTracerCollector is the canonical constructor with NewInMemoryTraceCollector kept as a deprecated alias.
internal/common/grpc/grpc.go Adds otelgrpc.NewServerHandler() as a stats handler on the gRPC server; also migrates prometheus.MustRegister to prometheus.DefaultRegisterer.MustRegister for testability.
internal/common/grpc/gateway.go Wraps REST gateway handler with otelhttp.NewHandler; refactors the stripPrefix / non-stripPrefix branches into a single mux.Handle call.
internal/common/logging/logger.go Adds WithContext() to extract trace_id, span_id, and x-request-id from the OTel span context and propagate them into log fields.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Main as Service main()
    participant InitOTel as observability.InitOTel
    participant TP as sdktrace.TracerProvider
    participant GrpcSrv as gRPC Server (otelgrpc handler)
    participant Gateway as REST Gateway (otelhttp)
    participant Collector as OTLP Collector
    participant Jaeger as Jaeger UI

    Main->>InitOTel: InitOTel(cfg)
    InitOTel->>TP: NewTracerProvider(SpanAttributePolicyProcessor, batcher, sampler)
    InitOTel->>TP: otel.SetTracerProvider(tp)

    GrpcSrv->>TP: Start span (otelgrpc.NewServerHandler)
    GrpcSrv->>TP: End span → SpanAttributePolicyProcessor.OnEnd (violations logged)
    Gateway->>TP: Start/End span (otelhttp.NewHandler)

    TP-->>Collector: BatchSpanProcessor → OTLP export (http/protobuf or grpc)
    Collector-->>Jaeger: Forward via otlp/jaeger exporter

    Main->>Main: defer ShutdownOTel(ctx)
    Main->>TP: tp.Shutdown (flush pending spans)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Main as Service main()
    participant InitOTel as observability.InitOTel
    participant TP as sdktrace.TracerProvider
    participant GrpcSrv as gRPC Server (otelgrpc handler)
    participant Gateway as REST Gateway (otelhttp)
    participant Collector as OTLP Collector
    participant Jaeger as Jaeger UI

    Main->>InitOTel: InitOTel(cfg)
    InitOTel->>TP: NewTracerProvider(SpanAttributePolicyProcessor, batcher, sampler)
    InitOTel->>TP: otel.SetTracerProvider(tp)

    GrpcSrv->>TP: Start span (otelgrpc.NewServerHandler)
    GrpcSrv->>TP: End span → SpanAttributePolicyProcessor.OnEnd (violations logged)
    Gateway->>TP: Start/End span (otelhttp.NewHandler)

    TP-->>Collector: BatchSpanProcessor → OTLP export (http/protobuf or grpc)
    Collector-->>Jaeger: Forward via otlp/jaeger exporter

    Main->>Main: defer ShutdownOTel(ctx)
    Main->>TP: tp.Shutdown (flush pending spans)
Loading

Reviews (3): Last reviewed commit: "fix path" | Re-trigger Greptile

Comment thread internal/common/observability/lifecycle.go Outdated
Comment thread internal/common/observability/attribute_policy.go
Comment thread internal/testsuite/trace_evidence.go
Comment thread internal/common/observability/lifecycle.go Outdated
@datadog-armadaproject

datadog-armadaproject Bot commented Jun 19, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 2 Pipeline jobs failed

CI | All jobs succeeded   View in Datadog   GitHub Actions

CI | test / Golang Integration Tests   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 53aefc4 | Docs | Give us feedback!

Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant