Skip to content

[PMON HLD] update get_reboot_cause mechanism and add get_midplane_dow…#2385

Open
chartsai-nvidia wants to merge 1 commit into
sonic-net:masterfrom
chartsai-nvidia:chartsai/pmon-hld-update
Open

[PMON HLD] update get_reboot_cause mechanism and add get_midplane_dow…#2385
chartsai-nvidia wants to merge 1 commit into
sonic-net:masterfrom
chartsai-nvidia:chartsai/pmon-hld-update

Conversation

@chartsai-nvidia

@chartsai-nvidia chartsai-nvidia commented Jun 11, 2026

Copy link
Copy Markdown

Why I did it

Refines the SmartSwitch PMON HLD for DPU reboot-cause and midplane-down handling:

  • The old design assumed the NPU could read a DPU reboot-cause while the DPU was dead, triggered
    only by an offline→online transition. Now the cause is captured only when the midplane is online,
    using a per-boot boot_id so chassisd reliably detects a real DPU reboot.
  • Adds a get_midplane_down_reason() platform API and documents planned vs. unplanned midplane-down
    reasons.
Work item tracking
  • Microsoft ADO (number only): N/A

How I did it

  • Reworked the DPU Reboot Cause flow around boot_id: the DPU publishes a fresh per-boot UUID into
    CHASSIS_STATE_DB; the NPU chassisd compares it to the last persisted value and calls
    get_reboot_cause() only on a real reboot with midplane up.
  • Added boot_id to the REBOOT_CAUSE and DPU_STATE schema examples.
  • Documented up→down midplane handling (planned via transition flag vs. unplanned via
    get_midplane_down_reason()) and added the new API definition.

How to verify it

Repo PR Title / Link Status
sonic-platform-daemons [chassisd] Capture DPU reboot cause via boot_id and add midplane-down reason for Smart Switch PR State
sonic-host-services [process-reboot-cause] Save boot_id and device fields for DPU reboot cause PR State
sonic-platform-common [ModuleBase] Add get_midplane_down_reason() API and MIDPLANE_DOWN_REASON_* constants PR State
sonic-buildimage [Mellanox] Implement get_midplane_down_reason for DPU module PR State

…n_reason

The commit updates 2 main parts:
- when to run get_reboot_cause
- get_midplane_down_reason

Signed-off-by: Charles Tsai <chartsai@nvidia.com>
@mssonicbld

Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

* Each DPU SONiC publishes a `boot_id` (a fresh UUID generated per boot from `/proc/sys/kernel/random/boot_id`) into its `DPU_STATE` entry in CHASSIS_STATE_DB. The NPU chassisd compares the reported `boot_id` against the last `boot_id` it persisted, when they differ a real DPU boot occurred, so chassisd calls `get_reboot_cause()` when midplane down transitions to up, records the cause to DB and json, and updates the persisted `boot_id`.
* The get_reboot_cause will return the current reboot-cause of the module.
* For persistent storage of the DPU reboot-cause and reboot-cause-history files use the existing mechanism and host storage path under "/host/reboot-cause/module/dpux".
* For persistent storage of the DPU reboot-cause and reboot-cause-history files use the existing mechanism and host storage path under "/host/reboot-cause/module/dpux". The boot_id is also stored in this file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you refer storing a boot_id file under the same dir?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boot_id is stored in an additional field in the reboot-cause JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants