Skip to content

[CONTP-1448] Add Windows node support#3154

Open
zhuminyi wants to merge 1 commit into
mainfrom
minyi/windows-operator-support
Open

[CONTP-1448] Add Windows node support#3154
zhuminyi wants to merge 1 commit into
mainfrom
minyi/windows-operator-support

Conversation

@zhuminyi

@zhuminyi zhuminyi commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds opt-in Windows node Agent support via spec.override.windowsNodeAgent. When configured, the Operator creates a Windows-targeted Agent DaemonSet alongside the existing Linux Agent DaemonSet.

Motivation

Enable Datadog Agent deployment on mixed Linux/Windows Kubernetes clusters managed by the Datadog Operator.
Key Changes:

  1. internal/controller/datadogagent/component/agent/windows.go
    New Windows Agent DaemonSet builder and Windows-specific pod sanitization. Handles Windows node targeting, init config, safe container allowlisting, Linux-only field stripping, -servercore image handling, non-local APM/DogStatsD traffic, and Windows log collection mounts.

  2. internal/controller/datadogagentinternal/controller_reconcile_windows_agent.go
    New Windows Agent reconciler. Handles opt-in behavior, unsupported FIPS/EDS cleanup, disabled cleanup, Windows container selection, feature filtering, pod sanitization, image normalization, intake reachability, log mounts, and DaemonSet create/update.

  3. internal/controller/datadogagentinternal/controller_reconcile_agent.go
    Chains Windows reconciliation into the existing node Agent flow so Linux and Windows DaemonSets stay in sync across normal, disabled, and EDS paths. reconcileV2WindowsAgent is called on every DDAI reconcile, for linux only DDA that means ensureWindowsDaemonSetAbsent will be always called on each reconcile.

  4. API/status generated files
    Adds status.agentWindows and the agent-windows printer column for both DatadogAgent and DatadogAgentInternal. Also fixes status handling so AgentWindows is preserved and compared correctly.

  5. examples/datadogagent/datadog-agent-with-windows-nodes.yaml
    Adds an example Windows configuration and documents current limitations, especially that local APM/DogStatsD services do not yet route to Windows pods.

image

Gaps addressed

# Gap Root cause Addressed by
1 Windows taint not tolerated The DaemonSet ships tolerations: []. Every Kubernetes distribution (GKE, EKS, AKS, kubeadm) automatically applies node.kubernetes.io/os=windows:NoSchedule to Windows nodes via kubelet, so no Windows pod is ever scheduled. NewDefaultWindowsAgentPodTemplateSpec adds the node.kubernetes.io/os=windows:NoSchedule toleration + nodeSelector os=windows (windows.go).
2 No Windows DaemonSet The operator generates a single DaemonSet with no OS awareness. Even if the taint were tolerated, the Linux pod spec would fail on all remaining gaps. New reconcileV2WindowsAgent builds a dedicated datadog-agent-windows DaemonSet, chained after the Linux DS on every reconcile path (controller_reconcile_windows_agent.go, controller_reconcile_agent.go).
3 Linux-only container image The operator always uses agent:X.Y.Z — a Linux binary. Windows requires a separate image (agent:X.Y.Z-servercore) compiled for Windows APIs. No Windows image reference exists anywhere in the codebase. GetLatestWindowsAgentImage + EnsureWindowsServercoreImage coerce the default agent image to the -servercore tag (pkg/images/images.go, windows.go).
4 Linux-specific securityContext Linux capabilities (SYS_ADMIN, NET_RAW, etc.), seccomp profiles, and readOnlyRootFilesystem are all rejected by the Windows container runtime. StripLinuxOnlySettings (allowlist) clears capabilities, seccomp, SELinux, AppArmor, readOnlyRootFilesystem, hostPID/hostIPC, and uses a nil pod SecurityContext (windows.go).
5 Linux-only hostPath volumes /proc, /sys/fs/cgroup, /etc/passwd, and Unix sockets at /var/run do not exist on Windows. The pod would fail to start. StripLinuxOnlySettings allowlist drops every mount/volume with a /-prefixed path and every *SOCKET*/DOCKER_HOST/DD_VSOCK_ADDR env var; AddWindowsLogCollectionVolumes adds the Windows C:/ log hostPaths (windows.go).
6 system-probe / security-agent always injected Both use eBPF (Linux kernel only). They are unconditionally added when features like liveProcessCollection or CSPM are enabled. windowsContainersFromFeatures + windowsSupportedFeatures allowlist keep only core/trace/process-agent; the strip allowlist removes any other container (controller_reconcile_windows_agent.go, windows.go).
7 Linux init containers The three init containers use bash and Linux paths. Windows has no bash. The trace-agent also requires a config file at C:\ProgramData\Datadog\datadog.yaml, which the Windows image does not include. Single PowerShell init-config-windows init container creates C:\ProgramData\Datadog\datadog.yaml + auth dir; all Linux init containers are stripped (windowsInitContainers in windows.go).
8 No Windows API surface No field in the DatadogAgent CRD to opt in to Windows or configure a Windows-specific image, resources, or tolerations. New spec.override.windowsNodeAgent component (WindowsNodeAgentComponentName) optrride fields (image, resources, etc.)(api/datadoghq/v2alpha1/datadogagent_types.go).

Additional Notes

Windows support is currently limited to the core Agent, trace Agent, and process Agent. Linux-only containers, mounts, security settings, and socket/env configuration are stripped from the Windows DaemonSet.

Current limitations:

  • Not supported with ExtendedDaemonSet.
  • Not supported with global.useFIPSAgent.
  • APM/DogStatsD require hostPort for Windows workload traffic because the existing local service selects Linux Agent pods only.

Describe your test plan

  1. Added unit tests for Windows DaemonSet construction, image selection, pod sanitization, status propagation, and cleanup behavior.
  2. E2E test on a cluster with a Linux node pool (operator, cluster-agent) and a Windows node pool, Deploy a DatadogAgent(DDA) that opts into Windows
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata: {name: datadog, namespace: datadog}
spec:
  global:
    credentials: {apiSecret: {secretName: datadog-secret, keyName: api-key}}
    site: datadoghq.com
  features:
    apm: {enabled: true}
  override:
    windowsNodeAgent: {}   # opt in
  1. Verify the two DaemonSets and the Windows pod
kubectl get ds -n datadog
#   datadog-agent           <linux node count>   NODE SELECTOR: <none>
#   datadog-agent-windows   <win node count>     NODE SELECTOR: kubernetes.io/os=windows

kubectl get pods -n datadog -o wide
#   datadog-agent-windows-xxxx   2/2 Running   <on the Windows node>
Expect: the Windows pod is 2/2 Running (agent + trace-agent) on the Windows node; the Linux DaemonSet is unchanged.
  1. Verify the status field & printer column
kubectl get datadogagent datadog -n datadog -o wide
# AGENT  AGENT-WINDOWS  CLUSTER-AGENT ...
# Running (n/n/n)  Running (m/m/m)  Running (1/1/1)
Expect: distinct AGENT and AGENT-WINDOWS columns; the Linux AGENT status is not overwritten.
  1. Verify the allowlist strip with eBPF/Linux features enabled (key test)
    Enable Linux-only features, then inspect the Windows DaemonSet — the allowlist guarantees it stays clean:
kubectl patch datadogagent datadog -n datadog --type merge -p \
  '{"spec":{"features":{"npm":{"enabled":true},"cspm":{"enabled":true},"sbom":{"enabled":true,"host":{"enabled":true}}}}}'
kubectl get ds datadog-agent-windows -n datadog -o json | jq '{
  containers: [.spec.template.spec.containers[].name],
  hostPathVolumes: [.spec.template.spec.volumes[].hostPath.path] | map(select(.)),
  hostPID: .spec.template.spec.hostPID,
  hostIPC: .spec.template.spec.hostIPC,
  linuxEnvLeaks: [.spec.template.spec.containers[].env[]? | select(.name|test("SOCKET|DOCKER_HOST|KUBELET_CLIENT_CA|VSOCK")) | .name]
}'

Result:
{
  "containers": ["agent", "trace-agent", "process-agent"],
  "hostPathVolumes": [],
  "hostPID": null,
  "hostIPC": null,
  "linuxEnvLeaks": []
}
  1. Verify cleanup on opt-out
kubectl patch datadogagent datadog -n datadog --type=json \
  -p='[{"op":"remove","path":"/spec/override/windowsNodeAgent"}]'
kubectl get ds datadog-agent-windows -n datadog        # → NotFound (deleted)
kubectl get datadogagent datadog -n datadog -o jsonpath='{.status.agentWindows}'   # → empty
kubectl get pods -n datadog | grep '^nt still Running

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@datadog-prod-us1-5

datadog-prod-us1-5 Bot commented Jun 17, 2026

Copy link
Copy Markdown

Code Coverage

Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

Patch coverage: 72.84% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 72.84%
Overall Coverage: 44.89% (+0.63%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 2dd7823 | Docs | Datadog PR Page | Give us feedback!

@zhuminyi zhuminyi force-pushed the minyi/windows-operator-support branch 2 times, most recently from 8ae634f to 59aec8a Compare June 17, 2026 14:33
@zhuminyi zhuminyi changed the title Minyi/windows operator support windows operator support Jun 17, 2026
@zhuminyi zhuminyi force-pushed the minyi/windows-operator-support branch from 59aec8a to baecd3a Compare June 23, 2026 02:56
@zhuminyi zhuminyi added enhancement New feature or request qa/skip-qa ddqa and removed qa/skip-qa ddqa labels Jun 23, 2026
@zhuminyi zhuminyi added this to the v1.29.0 milestone Jun 23, 2026
@zhuminyi zhuminyi force-pushed the minyi/windows-operator-support branch 9 times, most recently from 1dc1a99 to d8864f6 Compare June 23, 2026 11:53
@zhuminyi zhuminyi force-pushed the minyi/windows-operator-support branch 6 times, most recently from e8fb9a5 to ede06c4 Compare June 23, 2026 13:58
@zhuminyi zhuminyi force-pushed the minyi/windows-operator-support branch 4 times, most recently from 2852d18 to 390f49d Compare June 23, 2026 15:13
@zhuminyi zhuminyi changed the title windows operator support [CONTP-1448] Add Windows node support Jun 23, 2026
@zhuminyi zhuminyi force-pushed the minyi/windows-operator-support branch from 390f49d to aae4ee0 Compare June 23, 2026 15:26
@zhuminyi zhuminyi marked this pull request as ready for review June 23, 2026 15:27
@zhuminyi zhuminyi requested a review from a team June 23, 2026 15:27
@zhuminyi zhuminyi requested a review from a team as a code owner June 23, 2026 15:27

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aae4ee0813

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/controller/datadogagent/component/agent/windows.go
@zhuminyi zhuminyi force-pushed the minyi/windows-operator-support branch 10 times, most recently from dc3aa57 to 02298c2 Compare June 23, 2026 18:40
@zhuminyi zhuminyi requested a review from a team as a code owner June 23, 2026 18:40
When spec.override.windowsNodeAgent is present in the DatadogAgent CR, the
operator creates a second DaemonSet (datadog-agent-windows) targeting Windows
nodes alongside the existing Linux one. Each DaemonSet targets only its own OS
via nodeSelector + the node.kubernetes.io/os=windows:NoSchedule toleration, so
in a mixed cluster every node gets exactly one agent pod of the right type. The
Linux DaemonSet is unchanged; the feature is opt-in (absent key = no-op).

API / image:
- api: add WindowsNodeAgentComponentName = "windowsNodeAgent"
- api: add AgentWindows *DaemonSetStatus to DDAI + DDA status, with an
  agent-windows printer column (regenerated CRDs, deepcopy, openapi)
- images: add GetLatestWindowsAgentImage() -> agent:X.Y.Z-servercore

Windows DaemonSet builder (component/agent/windows.go):
- nodeSelector kubernetes.io/os=windows + Windows taint toleration
- servercore image; core agent + trace agent (+ process agent) only
- PowerShell init container creates an empty datadog.yaml + auth/ dir in a
  shared emptyDir mounted at C:/ProgramData/Datadog by all containers, so the
  IPC auth token written by the core agent is visible to the trace agent
- no Linux securityContext; DD_AUTH_TOKEN_FILE_PATH overridden to a Windows path
- StripLinuxOnlySettings (allowlist): keeps only core/trace/process containers
  and the Windows init container; drops any volume mount with a Linux ("/") path,
  unreferenced volumes, Unix-socket env vars, Linux securityContext fields,
  hostPID/hostIPC, and AppArmor annotations for removed containers

Reconciler (controller_reconcile_windows_agent.go):
- OS-aware feature gating: only Windows-supported features run ManageNodeAgent
- EnsureWindowsIntakeReachable forces APM/DogStatsD non-local traffic AFTER the
  feature loop so it isn't clobbered (Windows has no Unix socket)
- guards: FIPS and EDS are unsupported and surface a WindowsAgentReconcile
  condition; Linux-disabled still reconciles Windows
- ensureWindowsDaemonSetAbsent cleans up the Windows DS (owner-scoped by
  component + part-of labels) on opt-out / Disabled / FIPS / EDS

Tests: builder, strip (allowlist + socket/hostPID/AppArmor), image, and
reconciler (no-op, FIPS, EDS, disable, owner-scoped cleanup, status routing).
Example manifest: examples/datadogagent/datadog-agent-with-windows-nodes.yaml

Validated on GKE (Windows Server 2019, WINDOWS_LTSC_CONTAINERD): core + trace
agent Running, status surfaced, no Linux artifacts leak even with NPM/CSPM/SBOM
enabled.

Known limitation: the component=agent local APM/DogStatsD services do not route
to Windows pods (labeled agent-windows); Windows workload->agent traffic needs
hostPort until a Windows-specific local service is added. LogCollection is not
yet supported (needs Windows host log-path mounts).
@zhuminyi zhuminyi force-pushed the minyi/windows-operator-support branch from 02298c2 to 2dd7823 Compare June 23, 2026 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants