Skip to content

feat(plane-enterprise): native OpenTelemetry APM support (v2.5.1)#241

Open
pratapalakshmi wants to merge 1 commit into
masterfrom
chore/add/support/otel
Open

feat(plane-enterprise): native OpenTelemetry APM support (v2.5.1)#241
pratapalakshmi wants to merge 1 commit into
masterfrom
chore/add/support/otel

Conversation

@pratapalakshmi

@pratapalakshmi pratapalakshmi commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds first-class OpenTelemetry APM support to the plane-enterprise chart so self-hosters can enable backend tracing/metrics/log-correlation via values instead of hand-rolled extraEnv. Pairs with the feat/otel-api-observability work in plane-ee (which adds the configure_otel() bootstrap to the Django backend). Chart version bumped 2.5.0 → 2.5.1.

What changed

  • values.yaml — new observability.otel block (off by default) with a nested collector sub-block for an optional bundled OTLP collector.
  • templates/config-secrets/app-env.yaml — when observability.otel.enabled, injects OTEL_* into the backend -app-vars ConfigMap. Scoped to the six workloads that envFrom it (api, worker, beat-worker, automation-consumer, outbox-poller, migrator) — no frontend pods touched. Auth headers (if any) go into -app-secrets. When endpoint is blank and the bundled collector is enabled, the backend auto-targets the in-cluster collector Service.
  • templates/observability/otel-collector.yaml — bundled collector (ConfigMap/Service/Deployment), gated on observability.otel.collector.enabled, with an overridable config defaulting to an OTLP-in → debug-out pipeline.

How to enable

observability:
  otel:
    enabled: true
    serviceName: plane-api
    endpoint: ""          # blank → auto-targets the bundled collector below
    tracesSamplerArg: "1.0"
    collector:
      enabled: true       # deploy the in-cluster OTLP collector

Point at an external collector instead by setting observability.otel.endpoint and leaving collector.enabled: false.

Testing

Deployed to a live EKS cluster (namespace gpotel, isolated DB, backend image built from feat/otel-api-observability) and drove real traffic.

Render gating — helm template + helm lint:

Scenario Result
helm lint 1 chart(s) linted, 0 chart(s) failed
OTEL disabled (chart defaults) 0 collector resources, 0 OTEL_* env
OTEL on + collector off no collector resources; OTEL_EXPORTER_OTLP_ENDPOINT: "" (external-collector mode)
OTEL on + collector on collector deployed; backend endpoint auto-wired to http://<release>-otel-collector.<ns>.svc.cluster.local:4317

Live spans received by the bundled collector (kubectl logs deploy/plane-gpotel-otel-collector):

# HTTP server spans (DjangoInstrumentor) from browsing the app
Name: GET   -> http.method: Str(GET)   -> http.status_code: Int(200)
Name: POST  -> http.method: Str(POST)

# Celery spans — apply_async (enqueue) linked to run (execution) via
# traceparent propagated through RabbitMQ
celery.action: Str(apply_async)  celery.task_name: ...batched_search_update_task...
celery.action: Str(run)          celery.task_name: ...batched_search_update_task...

# Postgres child spans nested under the above
Name: SELECT   -> db.system: Str(postgresql)

# all tagged
service.name: Str(plane-api)

Log ↔ trace correlation: worker/Celery logs carry populated trace_id (32-hex), confirming the TraceContextFilter path. (Follow-up, app-side not chart-side: the API request access log currently emits empty trace_id/span_id because it logs outside the active span context — tracked against the plane-ee branch, not this chart.)

Notes

  • Bundled collector defaults to the debug exporter (verification only) — set observability.otel.collector.config to export to a real backend (Tempo/Datadog/Honeycomb), or disable it and use observability.otel.endpoint.
  • Collector image pinned to otel/opentelemetry-collector-contrib:0.115.1.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added OpenTelemetry observability integration with optional in-cluster collector support for enhanced monitoring capabilities.
  • Chores

    • Bumped Helm chart version to 2.5.1.

Add first-class OTEL configuration to the backend instead of relying on
ad-hoc extraEnv. Bumps chart version to 2.5.1.

- values.yaml: new `observability.otel` block (off by default) with a
  nested `collector` sub-block for an optional bundled OTLP collector.
- config-secrets/app-env.yaml: when `observability.otel.enabled`, inject
  OTEL_* into the backend `-app-vars` ConfigMap (scoped to the six
  workloads that envFrom it: api, worker, beat-worker, automation-consumer,
  outbox-poller, migrator). Auth headers go into `-app-secrets`. When the
  endpoint is blank and the bundled collector is enabled, the backend
  auto-targets the in-cluster collector Service.
- templates/observability/otel-collector.yaml: bundled collector
  (ConfigMap/Service/Deployment), gated on
  `observability.otel.collector.enabled`, with an overridable config that
  defaults to an OTLP-in -> debug-out pipeline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR adds OpenTelemetry observability support to the Plane Enterprise Helm chart. It introduces OTEL configuration defaults, environment variable injection for application services, and an optional in-cluster OpenTelemetry Collector deployment with OTLP gRPC/HTTP endpoints. The chart version is bumped to 2.5.1.

Changes

OpenTelemetry Observability Integration

Layer / File(s) Summary
Chart version and OTEL configuration schema
charts/plane-enterprise/Chart.yaml, charts/plane-enterprise/values.yaml
Chart version incremented to 2.5.1. New observability.otel configuration block added to values with defaults for OTLP endpoint, protocol, trace sampling (traces sampler and sampler arg), optional resource attributes and auth headers, and bundled collector settings (enable flag, image, replica count, resources, and Kubernetes scheduling metadata).
Application OTEL environment variables
charts/plane-enterprise/templates/config-secrets/app-env.yaml
Secret now conditionally includes OTEL_EXPORTER_OTLP_HEADERS when headers are configured. ConfigMap now conditionally populates OTEL environment variables (OTEL_ENABLED, service name, endpoint selection between direct or collector, protocol, trace sampling settings, optional resource attributes) for Django services when observability is enabled.
Bundled OpenTelemetry Collector deployment
charts/plane-enterprise/templates/observability/otel-collector.yaml
New template conditionally renders a complete in-cluster Collector deployment: ConfigMap with user-supplied or default OTLP receiver/batching/debug-exporter configuration, Service exposing OTLP gRPC (4317) and HTTP (4318) endpoints, and Deployment running the collector with configurable image (otel/opentelemetry-collector-contrib:0.115.1), replicas, resource requests/limits, and pod scheduling.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🐰 Traces dance through Kubernetes skies,
OTEL collectors open watchful eyes,
From endpoint to sampler, all configured right,
Observability blooms—your apps shine bright!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding native OpenTelemetry APM support to the plane-enterprise Helm chart with a version bump to 2.5.1.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/add/support/otel

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/plane-enterprise/templates/config-secrets/app-env.yaml`:
- Line 122: The OTEL_EXPORTER_OTLP_ENDPOINT value is hardcoded to
"cluster.local" which breaks clusters with custom domains; update the
OTEL_EXPORTER_OTLP_ENDPOINT entry to construct the service FQDN using the same
pattern used elsewhere by interpolating {{ .Release.Name }}, {{
.Release.Namespace }} and the parameterized cluster domain via {{
.Values.env.default_cluster_domain | default "cluster.local" }} so the endpoint
resolves correctly for custom cluster domains while preserving the collector
host and port.
- Around line 121-126: The OTEL_EXPORTER_OTLP_ENDPOINT currently hardcodes port
4317 (gRPC) regardless of .Values.observability.otel.protocol, causing
protocol/port mismatch; update the template that sets
OTEL_EXPORTER_OTLP_ENDPOINT to choose port based on the protocol
(OTEL_EXPORTER_OTLP_PROTOCOL) — use 4317 for "grpc" and 4318 for "http/protobuf"
(or map other protocol values accordingly) when
.Values.observability.otel.collector.enabled is true, keeping the same host
formation ({{ .Release.Name }}-otel-collector.{{ .Release.Namespace
}}.svc.cluster.local) so the exporter speaks the correct port for the configured
protocol.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dc26bcbe-f3cf-4ff4-8e2b-d3198a45510e

📥 Commits

Reviewing files that changed from the base of the PR and between 8ad559d and e286fd4.

📒 Files selected for processing (4)
  • charts/plane-enterprise/Chart.yaml
  • charts/plane-enterprise/templates/config-secrets/app-env.yaml
  • charts/plane-enterprise/templates/observability/otel-collector.yaml
  • charts/plane-enterprise/values.yaml

Comment on lines +121 to +126
{{- else if .Values.observability.otel.collector.enabled }}
OTEL_EXPORTER_OTLP_ENDPOINT: "http://{{ .Release.Name }}-otel-collector.{{ .Release.Namespace }}.svc.cluster.local:4317"
{{- else }}
OTEL_EXPORTER_OTLP_ENDPOINT: ""
{{- end }}
OTEL_EXPORTER_OTLP_PROTOCOL: {{ .Values.observability.otel.protocol | default "grpc" | quote }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Auto-target endpoint hardcodes gRPC port 4317, ignoring protocol.

When endpoint is empty and the bundled collector is enabled, the backend is always wired to port 4317 (gRPC). If the user sets observability.otel.protocol: http/protobuf, the exporter (line 126 emits OTEL_EXPORTER_OTLP_PROTOCOL) will speak HTTP/protobuf against the gRPC port and exports will fail silently. The collector listens on both ports, so the client must select the matching one.

🐛 Proposed fix: pick port by protocol
   {{- else if .Values.observability.otel.collector.enabled }}
-  OTEL_EXPORTER_OTLP_ENDPOINT: "http://{{ .Release.Name }}-otel-collector.{{ .Release.Namespace }}.svc.cluster.local:4317"
+  {{- if eq (.Values.observability.otel.protocol | default "grpc") "grpc" }}
+  OTEL_EXPORTER_OTLP_ENDPOINT: "http://{{ .Release.Name }}-otel-collector.{{ .Release.Namespace }}.svc.cluster.local:4317"
+  {{- else }}
+  OTEL_EXPORTER_OTLP_ENDPOINT: "http://{{ .Release.Name }}-otel-collector.{{ .Release.Namespace }}.svc.cluster.local:4318"
+  {{- end }}
   {{- else }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@charts/plane-enterprise/templates/config-secrets/app-env.yaml` around lines
121 - 126, The OTEL_EXPORTER_OTLP_ENDPOINT currently hardcodes port 4317 (gRPC)
regardless of .Values.observability.otel.protocol, causing protocol/port
mismatch; update the template that sets OTEL_EXPORTER_OTLP_ENDPOINT to choose
port based on the protocol (OTEL_EXPORTER_OTLP_PROTOCOL) — use 4317 for "grpc"
and 4318 for "http/protobuf" (or map other protocol values accordingly) when
.Values.observability.otel.collector.enabled is true, keeping the same host
formation ({{ .Release.Name }}-otel-collector.{{ .Release.Namespace
}}.svc.cluster.local) so the exporter speaks the correct port for the configured
protocol.

{{- if .Values.observability.otel.endpoint }}
OTEL_EXPORTER_OTLP_ENDPOINT: {{ .Values.observability.otel.endpoint | quote }}
{{- else if .Values.observability.otel.collector.enabled }}
OTEL_EXPORTER_OTLP_ENDPOINT: "http://{{ .Release.Name }}-otel-collector.{{ .Release.Namespace }}.svc.cluster.local:4317"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Hardcoded cluster.local breaks custom cluster domains.

Elsewhere in this same template the cluster domain is parameterized (e.g. Line 36 and Line 97 use {{ .Values.env.default_cluster_domain | default "cluster.local" }}). The auto-target endpoint here pins cluster.local, so clusters with a custom DNS domain won't resolve the collector Service.

♻️ Align with the existing pattern
-  OTEL_EXPORTER_OTLP_ENDPOINT: "http://{{ .Release.Name }}-otel-collector.{{ .Release.Namespace }}.svc.cluster.local:4317"
+  OTEL_EXPORTER_OTLP_ENDPOINT: "http://{{ .Release.Name }}-otel-collector.{{ .Release.Namespace }}.svc.{{ .Values.env.default_cluster_domain | default "cluster.local" }}:4317"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@charts/plane-enterprise/templates/config-secrets/app-env.yaml` at line 122,
The OTEL_EXPORTER_OTLP_ENDPOINT value is hardcoded to "cluster.local" which
breaks clusters with custom domains; update the OTEL_EXPORTER_OTLP_ENDPOINT
entry to construct the service FQDN using the same pattern used elsewhere by
interpolating {{ .Release.Name }}, {{ .Release.Namespace }} and the
parameterized cluster domain via {{ .Values.env.default_cluster_domain | default
"cluster.local" }} so the endpoint resolves correctly for custom cluster domains
while preserving the collector host and port.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant