Skip to content

fix: Handle JBOD/passthrough disks behind RAID controllers gracefully#138

Merged
raharper merged 8 commits into
project-machine:mainfrom
andaaron:jbod
Apr 28, 2026
Merged

fix: Handle JBOD/passthrough disks behind RAID controllers gracefully#138
raharper merged 8 commits into
project-machine:mainfrom
andaaron:jbod

Conversation

@andaaron
Copy link
Copy Markdown
Contributor

@andaaron andaaron commented Apr 7, 2026

When a disk is visible through a RAID's sysfs tree but has no
corresponding virtual/logical drive entry (e.g. JBOD or passthrough
mode), GetDiskType previously returned a hard error that aborted disk
scanning entirely. This broke ScanAllDisks on systems with mixed RAID
and JBOD configurations.

Introduce the ErrNotVirtualDrive sentinel so RAID drivers can signal
"I queried successfully but this device isn't one of my VDs" vs a real
failure. In linux/system.go, consolidate the RAID-query and udev
classifier paths into resolveDiskType: on ErrNotVirtualDrive (wrapped
or not) it falls through to the generic sysfs-based detection instead
of failing; genuine controller errors still propagate. ScanDisk and
GetDiskType now share this single code path.

Changes across all three RAID drivers (megaraid, mpi3mr, smartpqi):

  • Return ErrNotVirtualDrive when the controller query succeeds but the
    device path matches no virtual/logical drive.
  • Implement IsSysPathRAID via the new sysfs package instead of
    returning false unconditionally.

Extract IsSysPathRAID and GetSysPaths into a new linux/sysfs package
so the drivers can use them directly without a circular import back
into the linux package, and so the logic becomes independently
mockable. linux/util.go keeps thin, deprecated wrappers for backwards
compatibility.

Tests:

  • linux/system_test.go: mock-based coverage of resolveDiskType for
    RAID match, JBOD/ErrNotVirtualDrive fallback to udev, wrapped
    sentinel detection, real errors, multi-controller ordering, and a
    virtio-disk fallback case.
  • megaraid/storcli_test.go: coverage for the new ErrNotVirtualDrive
    return path.
  • linux/sysfs/sysfs_test.go: hermetic tempdir-based tests for
    GetSysPaths (single/multi BDF, missing dir, broken symlinks,
    malformed glob pattern) and IsSysPathRAID (no-host gate, /sys
    auto-prepend, unresolvable path, no-match fall-through, and one
    positive-match test that synthesises a fake driver dir whose BDF
    symlink targets a real /sys device, which gives 100% coverage of
    the new package without requiring root or special hardware).

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 64.46886% with 97 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.45%. Comparing base (83a9a74) to head (40025f6).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
megaraid/storcli.go 65.51% 25 Missing and 5 partials ⚠️
smartpqi/arcconf.go 52.00% 23 Missing and 1 partial ⚠️
mpi3mr/storcli2.go 54.00% 22 Missing and 1 partial ⚠️
linux/sysfs/sysfs.go 52.17% 10 Missing and 1 partial ⚠️
linux/system.go 78.57% 3 Missing and 3 partials ⚠️
linux/util.go 0.00% 2 Missing ⚠️
linux/sysfs/scsi.go 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
+ Coverage   49.97%   52.45%   +2.48%     
==========================================
  Files          27       29       +2     
  Lines        3666     3845     +179     
==========================================
+ Hits         1832     2017     +185     
+ Misses       1621     1610      -11     
- Partials      213      218       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@raharper raharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I'd like to see the storcli output of a JBOD box and integrate that into the _tests we currently have for correctly identifying disks on RAID, but not classify the disk as a RAID type.

@andaaron andaaron marked this pull request as ready for review April 21, 2026 07:24
Copy link
Copy Markdown
Collaborator

@raharper raharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass on the review; would like some changes and some more info in some of the places that got refactored.

Once code looks OK, I'll review the new tests.

Comment thread linux/system.go Outdated
Comment thread linux/system.go Outdated
Comment thread linux/system.go
Comment thread linux/system_test.go
Comment thread linux/util.go
Comment thread mpi3mr/storcli2.go Outdated
Comment thread mpi3mr/storcli2.go
Comment thread mpi3mr/storcli2.go Outdated
Comment thread smartpqi/arcconf.go Outdated
Comment thread linux/system.go Outdated
Copy link
Copy Markdown
Collaborator

@raharper raharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a closer look at this PR now I'd like to discuss how to approach it.

  1. a first commit to fix the exit-early-if-not-found bug when looking up paths on raid controller; with this adding a test that has output with disk on and off a virtual drive which can demonstrate the failure and ensure we exhaust all virtual drive items before returning a fallback value.

  2. I like the sysfs package refactor, in the review I responded to your information about the duplicate IsSysPathRAID portion of the RaidController interface; I think we can drop that from the interface completely; in system.go function where we call IsSysPathRAID, we only need to invoke RAID controller's DriverSysPath() function and invoke the sysfs.go IsSysPathRAID().

  3. I also left a comment about how currently when the PR does not find a disk path in on the RAID controller, we return disk Type HDD and the errNotVirtualDrive; then we apply the udevinfo disk type to the drive. Specifically this may not be accurate; the RAID controller may not expose that information to linux correctly so the only accurate info on JBOD drives is from the RAID controller itself. I suggest we extend the raid controller drivers to track disk that aren't in the raid (sometimes they may have their own JBOD list) but the physical disk list will indicate if disk is in JBOD mode and if it's HDD/SDD

Lastly, I suggest dropping a lot of the overly verbose comments in the test cases. I personally find a lot of that noisy. I also don't want a lot of the negative test cases around parsing sysfs and symlinks; those will always be there and work, or the kernel driver isn't working and disko can't do anything about that.

Comment thread linux/system_test.go
Comment thread linux/system_test.go Outdated
Comment thread linux/system_test.go Outdated
Comment thread linux/system_test.go Outdated
Comment thread linux/sysfs/sysfs.go Outdated
Comment thread mpi3mr/storcli2.go Outdated
Comment thread megaraid/storcli_test.go
Comment thread smartpqi/arcconf.go Outdated
Comment thread smartpqi/arcconf.go Outdated
Comment thread smartpqi/arcconf.go Outdated
When a disk is visible through a RAID's sysfs tree but has no
corresponding virtual/logical drive entry (e.g. JBOD or passthrough
mode), GetDiskType previously returned a hard error that aborted disk
scanning entirely. This broke ScanAllDisks on systems with mixed RAID
and JBOD configurations.

Core change:
- Introduce disko.ErrNotVirtualDrive so RAID drivers can distinguish
  "controller query succeeded but this path is not one of my VDs/LDs"
  from a genuine failure.
- In linux/system.go, collapse ScanDisk's and GetDiskType's RAID+udev
  classification into a single resolveDiskType helper. On
  ErrNotVirtualDrive (wrapped or not) it logs and falls through to the
  existing udev-based getDiskType; real controller errors still
  propagate. ScanDisk now sets Attachment=RAID only when the
  controller claimed the device.
- Replace the hard-coded call to IsSysPathRAID with an injectable
  linuxSystem.isSysPathRAID field (defaulting to the real helper) so
  resolveDiskType is unit-testable without touching /sys.

Driver changes (megaraid, mpi3mr, smartpqi):
- cachingStorCli / storCli2 / arcConf return ErrNotVirtualDrive once
  the controller has been queried successfully but the path matches no
  VD/LD. mpi3mr and smartpqi track this with an explicit isPhysical
  flag and accumulate query errors in queryErrs (mpi3mr also shadows
  the stdlib errors import cleanly, and fixes a latent bug where a
  non-SSD VD would fall through instead of returning HDD).
- Drop the stub IsSysPathRAID(string) bool method from the MegaRaid,
  Mpi3mr and SmartPqi interfaces and their implementations. All three
  had been returning false unconditionally; the membership check is now
  the responsibility of linuxSystem only, removing dead code from the
  driver layer.

linux package reorg:
- Move IsSysPathRAID and GetSysPaths out of linux/util.go into a new
  linux/sysfs.go (same package, no API break); util.go is trimmed
  accordingly.
- Remove IsSysPathRAID from the RAIDController interface.

Tests:
- linux/system_test.go (new): mock-based coverage of resolveDiskType
  for RAID VD match, ErrNotVirtualDrive fallback to udev, wrapped
  sentinel detection, genuine controller errors, multi-controller
  ordering, and a non-RAID/virtio disk falling straight through to
  getDiskType.
- linux/sysfs_test.go (new): hermetic tempdir-based tests for
  GetSysPaths (single/multi BDF, missing dir, broken symlinks, module
  symlink skipped, malformed glob pattern) and IsSysPathRAID (no-host
  gate, /sys auto-prepend, unresolvable path, no-match fall-through,
  and a positive-match case that synthesises a fake driver dir whose
  BDF symlink targets a real /sys device - exercising the full path
  without requiring root or specific hardware).
- megaraid/storcli_test.go: new coverage for the ErrNotVirtualDrive
  return path from cachingStorCli.GetDiskType.

Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
When a disk is owned by a RAID HBA but isn't part of a VD/LD (JBOD or
passthrough), we currently return ErrNotVirtualDrive and let scanning
fall through to generic udev detection - losing the controller's
authoritative media-type information. Now that the controller query has
already succeeded, correlate each JBOD with its matching controller
record and use that record's medium, while still falling back when no
unambiguous match exists.

Interface change: RAIDController.GetDiskType gains a disko.UdevInfo
parameter (breaking change, propagated to the MegaRaid, Mpi3mr and
SmartPqi driver interfaces and their in-tree implementations and mocks).
linux/system.go's resolveDiskType simply threads udInfo through.

Per-driver correlation:
- megaraid: parse the SCSI target T from /sys/block/<kname>/device and
  match it against Drive.DID; classify via Drive.MediaType.
- mpi3mr: same T, matched against PhysicalDrive.PID across all
  inspected controllers; classify via PhysicalDrive.Medium.
- smartpqi: SCSI T doesn't match controller IDs here, so match udev
  ID_SERIAL_SHORT (with ID_SERIAL as fallback) against
  PhysicalDevice.SerialNumber; classify via PhysicalDevice.Type.

Both SCSI-based drivers inject the helper via sysRoot + scsiTargetFn
fields so tests can stub sysfs without touching /sys. Unit tests cover
match, no-match, ambiguous-match, unresolved-target and missing-name
cases per driver.

Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Move IsSysPathRAID, GetSysPaths and readSCSITarget into a new
linux/sysfs package (with readSCSITarget exported as ReadSCSITarget) so
driver packages and the top-level linux package can share one
implementation. The subpackage location avoids an import cycle since
linux already depends on megaraid/mpi3mr/smartpqi.

Files:
- linux/sysfs.go -> linux/sysfs/sysfs.go (+package doc)
- linux/sysfs_test.go -> linux/sysfs/sysfs_test.go
- megaraid/scsi.go -> linux/sysfs/scsi.go (renamed/exported, generic
  doc covering both megaraid DID and mpi3mr PID)
- megaraid/scsi_test.go -> linux/sysfs/scsi_test.go

Callers:
- linux/system.go: default isSysPathRAID is now sysfs.IsSysPathRAID.
- megaraid/storcli.go, mpi3mr/storcli2.go: default scsiTargetFn is now
  sysfs.ReadSCSITarget. Injection type is unchanged so existing stub
  tests keep compiling.
- mpi3mr/storcli2_test.go: the one test that referenced the production
  helper directly now uses sysfs.ReadSCSITarget.
- linux/util.go: add thin IsSysPathRAID and GetSysPaths wrappers that
  forward to the sysfs package, marked // Deprecated: so existing
  external callers of the top-level linux package keep working.

Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Replace the SCSI-target/Drive.DID pairing used by cachingStorCli with
serial-number matching, aligning megaraid with the smartpqi driver.
Query issues a best-effort `storcli /cN/eall/sall show all` whose output
is parsed into (EID, Slot) -> SN and stamped onto the new
Drive.SerialNumber field. jbodDiskTypeFromSerial then matches udev
ID_SERIAL_SHORT/ID_SERIAL against it and maps Drive.MediaType
(HDD/SSD/new NVME) to disko.DiskType; sysRoot/scsiTargetFn and the
linux/sysfs import are dropped. Tests cover the new parser, SN
propagation through newController, and the serial-based GetDiskType
paths.

Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Drivers previously returned disko.HDD as a placeholder on every error
path, which is indistinguishable from a real HDD classification. Add a
new DiskType, Unknown, reserved for error returns.

- disk.go: new Unknown constant, String()/StringToDiskType updated;
  default unknown-string fallback remains HDD (public contract intact).
- megaraid/storcli.go, mpi3mr/storcli2.go, smartpqi/arcconf.go: all
  GetDiskType error returns switch HDD -> Unknown. arcconf.go also
  fixes one %s -> %w in the top-level wrap so errors.Is works
  consistently across drivers.
- linux/system.go: resolveDiskType returns Unknown on error, and adds a
  defense-in-depth guard that rejects a (Unknown, nil) driver return
  with a hard error so Unknown never reaches disko.Disk.Type.
- Tests: update driver and system mocks to match the new contract and
  add a case for the new leak guard.

Unknown is paired with a non-nil error in every return path and is
never written to disko.Disk.Type, so no downstream consumer
(storage, bootmgr, spm, bootstrap, configmgr) needs any change.

Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Copy link
Copy Markdown
Collaborator

@raharper raharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much improved. Thanks. Left some more inline comments/requests. Thank for working on this.

Comment thread linux/sysfs/scsi.go Outdated
Comment thread linux/sysfs/scsi.go Outdated
Comment thread linux/sysfs/scsi.go Outdated
Comment thread linux/sysfs/scsi.go Outdated
Comment thread linux/sysfs/scsi.go Outdated
Comment thread linux/system.go
Comment thread linux/system.go Outdated
Comment thread linux/system_test.go Outdated
Comment thread mpi3mr/storcli2.go
Comment thread smartpqi/arcconf.go Outdated
- Hoist collectUdevSerials into UdevInfo.CollectSerials, dropping the
  duplicate copies in megaraid and smartpqi.
- linux/sysfs.ReadSCSITarget: expand Host:Controller:Target:LUN doc with
  a symlink example, document the ID_SCSI=1 upstream-filter contract,
  rename internals, and return a real error on malformed link targets.
- mpi3mr.jbodDiskTypeFromSCSI: short-circuit on ID_SCSI != "1" so
  virtio-blk, NVMe and ATA/SATA never trigger a sysfs lookup.
- linux.resolveDiskType: trim the defense-in-depth comment and note the
  RAID-binding skip.
- Migrate the tests to testify/assert and require; use
  realistic DEVPATHs in linux/system_test.

Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Copy link
Copy Markdown
Collaborator

@raharper raharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments. I left a couple more, but I don't think they are blockers.

Comment thread linux/sysfs/scsi.go Outdated
// error. sysRoot is injectable for tests; production passes "/sys".
func ReadSCSITarget(sysRoot, kname string) (target int, ok bool, err error) {
if kname == "" {
return 0, false, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return 0, false, nil
return 0, false, fmt.Errorf("invalid empty 'kname' parameter")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we'd call this if kname is empty

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. I made the change.

Comment thread mpi3mr/storcli2.go
Reject empty kname in linux/sysfs.ReadSCSITarget with an explicit
error; every production call site already filters on udInfo.Name, so
an empty kname is a caller bug.

In mpi3mr/storcli2.jbodDiskTypeFromSCSI, classify Medium="NVMe" as
disko.NVME for parity with megaraid's jbodDiskTypeFromSerial.

Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Copy link
Copy Markdown
Collaborator

@raharper raharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through all of the changes!

@raharper raharper merged commit e79e163 into project-machine:main Apr 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants