Add opentelemetry to Armada#4973
Conversation
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
f0f80ca to
39e0cbb
Compare
Greptile SummaryThis PR adds OpenTelemetry distributed tracing to all Armada services (server, scheduler, executor, binoculars, and the various ingesters) using the OTLP exporter, and introduces an HTTP and gRPC gateway instrumentation layer alongside a span attribute policy that enforces allow-lists, deny-lists, and cardinality guardrails to prevent PII leakage.
Confidence Score: 4/5The change is broadly safe to merge; the only concrete defect is that the server's internal gRPC client connection registers two OTel stats handlers simultaneously, producing duplicate client spans. Every call made by the server's internal gRPC client (createApiConnection) will produce two client-side OTel spans per RPC because grpc.WithStatsHandler(otelgrpc.NewClientHandler()) is set both inside CreateApiConnectionWithCallOptions (for all callers) and again as an explicit additional option in server.go. gRPC v1.56+ invokes all registered stats handlers, so both fire. This pollutes traces with spurious duplicates and may skew sampling counts in production. internal/server/server.go (duplicate stats handler in createApiConnection) and pkg/client/connection.go (new unconditional handler that callers need to be aware of). Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Main as Service main()
participant InitOTel as observability.InitOTel
participant TP as sdktrace.TracerProvider
participant GrpcSrv as gRPC Server (otelgrpc handler)
participant Gateway as REST Gateway (otelhttp)
participant Collector as OTLP Collector
participant Jaeger as Jaeger UI
Main->>InitOTel: InitOTel(cfg)
InitOTel->>TP: NewTracerProvider(SpanAttributePolicyProcessor, batcher, sampler)
InitOTel->>TP: otel.SetTracerProvider(tp)
GrpcSrv->>TP: Start span (otelgrpc.NewServerHandler)
GrpcSrv->>TP: End span → SpanAttributePolicyProcessor.OnEnd (violations logged)
Gateway->>TP: Start/End span (otelhttp.NewHandler)
TP-->>Collector: BatchSpanProcessor → OTLP export (http/protobuf or grpc)
Collector-->>Jaeger: Forward via otlp/jaeger exporter
Main->>Main: defer ShutdownOTel(ctx)
Main->>TP: tp.Shutdown (flush pending spans)
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Main as Service main()
participant InitOTel as observability.InitOTel
participant TP as sdktrace.TracerProvider
participant GrpcSrv as gRPC Server (otelgrpc handler)
participant Gateway as REST Gateway (otelhttp)
participant Collector as OTLP Collector
participant Jaeger as Jaeger UI
Main->>InitOTel: InitOTel(cfg)
InitOTel->>TP: NewTracerProvider(SpanAttributePolicyProcessor, batcher, sampler)
InitOTel->>TP: otel.SetTracerProvider(tp)
GrpcSrv->>TP: Start span (otelgrpc.NewServerHandler)
GrpcSrv->>TP: End span → SpanAttributePolicyProcessor.OnEnd (violations logged)
Gateway->>TP: Start/End span (otelhttp.NewHandler)
TP-->>Collector: BatchSpanProcessor → OTLP export (http/protobuf or grpc)
Collector-->>Jaeger: Forward via otlp/jaeger exporter
Main->>Main: defer ShutdownOTel(ctx)
Main->>TP: tp.Shutdown (flush pending spans)
Reviews (3): Last reviewed commit: "fix path" | Re-trigger Greptile |
Signed-off-by: Nikola Jokic <jokicnikola07@gmail.com>
What type of PR is this?
Enhancement
What this PR does / why we need it
Improve observability of armada services and their interactions