Skip to content

Fix node-exporter port conflict by templatizing hardcoded port 9100#739

Open
silokimmo wants to merge 1 commit into
mainfrom
fix-node-exporter-port-template
Open

Fix node-exporter port conflict by templatizing hardcoded port 9100#739
silokimmo wants to merge 1 commit into
mainfrom
fix-node-exporter-port-template

Conversation

@silokimmo

Copy link
Copy Markdown

Summary

  • Replaces 4 hardcoded 9100 port references in the node-exporter DaemonSet template with {{ .Values.services.nodeExporter.metrics }}
  • Default port remains 9100 (no change for clusters without conflicts)
  • Clusters with a port conflict (e.g. AMD fleet-observability or systemd prometheus-node-exporter occupying 9100) can now override via cluster-values:
apps:
  otel-lgtm-stack:
    helmParameters:
      - name: services.nodeExporter.metrics
        value: "9101"

Background

On nodes where a host-level node_exporter already occupies port 9100 (AMD fleet-observability stack, Ubuntu prometheus-node-exporter systemd package), the cluster-forge DaemonSet pod enters CrashLoopBackOff immediately after bootstrap. The 4 hardcoded 9100 references prevented any per-cluster port override from working.

Affected environments: rck-g03-mi350x, workload-dev/tw016.

Test plan

  • Deployed on ephemeral OCI test VM with bloom (fix-node-exporter-port-template branch)
  • Replicated port conflict using netcat to reserve port 9100 → confirmed CrashLoopBackOff
  • Applied helmParameters override in cluster-values → node-exporter came up Running on port 9101
  • Validated on workload-dev (real environment, tw016 had 1974 restarts) → both pods Running after DaemonSet delete + ArgoCD sync

Jira: EAI-6663

🤖 Generated with Claude Code

Lines 98, 106, 113, 124 had hardcoded 9100 while the Service (lines
37-38) already used .Values.services.nodeExporter.metrics. This caused
CrashLoopBackOff on clusters where AMD fleet-observability or a host
systemd node_exporter pre-occupies port 9100 (rck-g03, workload-dev
tw016). The cluster-values port override had no effect on the DaemonSet.

Default in values.yaml remains 9100 — no change to other clusters.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@silokimmo silokimmo requested a review from a team as a code owner June 9, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant