fix: Handle JBOD/passthrough disks behind RAID controllers gracefully#138
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #138 +/- ##
==========================================
+ Coverage 49.97% 52.45% +2.48%
==========================================
Files 27 29 +2
Lines 3666 3845 +179
==========================================
+ Hits 1832 2017 +185
+ Misses 1621 1610 -11
- Partials 213 218 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
raharper
left a comment
There was a problem hiding this comment.
If possible, I'd like to see the storcli output of a JBOD box and integrate that into the _tests we currently have for correctly identifying disks on RAID, but not classify the disk as a RAID type.
raharper
left a comment
There was a problem hiding this comment.
First pass on the review; would like some changes and some more info in some of the places that got refactored.
Once code looks OK, I'll review the new tests.
raharper
left a comment
There was a problem hiding this comment.
With a closer look at this PR now I'd like to discuss how to approach it.
-
a first commit to fix the exit-early-if-not-found bug when looking up paths on raid controller; with this adding a test that has output with disk on and off a virtual drive which can demonstrate the failure and ensure we exhaust all virtual drive items before returning a fallback value.
-
I like the sysfs package refactor, in the review I responded to your information about the duplicate IsSysPathRAID portion of the RaidController interface; I think we can drop that from the interface completely; in system.go function where we call IsSysPathRAID, we only need to invoke RAID controller's DriverSysPath() function and invoke the sysfs.go IsSysPathRAID().
-
I also left a comment about how currently when the PR does not find a disk path in on the RAID controller, we return disk Type HDD and the errNotVirtualDrive; then we apply the udevinfo disk type to the drive. Specifically this may not be accurate; the RAID controller may not expose that information to linux correctly so the only accurate info on JBOD drives is from the RAID controller itself. I suggest we extend the raid controller drivers to track disk that aren't in the raid (sometimes they may have their own JBOD list) but the physical disk list will indicate if disk is in JBOD mode and if it's HDD/SDD
Lastly, I suggest dropping a lot of the overly verbose comments in the test cases. I personally find a lot of that noisy. I also don't want a lot of the negative test cases around parsing sysfs and symlinks; those will always be there and work, or the kernel driver isn't working and disko can't do anything about that.
When a disk is visible through a RAID's sysfs tree but has no corresponding virtual/logical drive entry (e.g. JBOD or passthrough mode), GetDiskType previously returned a hard error that aborted disk scanning entirely. This broke ScanAllDisks on systems with mixed RAID and JBOD configurations. Core change: - Introduce disko.ErrNotVirtualDrive so RAID drivers can distinguish "controller query succeeded but this path is not one of my VDs/LDs" from a genuine failure. - In linux/system.go, collapse ScanDisk's and GetDiskType's RAID+udev classification into a single resolveDiskType helper. On ErrNotVirtualDrive (wrapped or not) it logs and falls through to the existing udev-based getDiskType; real controller errors still propagate. ScanDisk now sets Attachment=RAID only when the controller claimed the device. - Replace the hard-coded call to IsSysPathRAID with an injectable linuxSystem.isSysPathRAID field (defaulting to the real helper) so resolveDiskType is unit-testable without touching /sys. Driver changes (megaraid, mpi3mr, smartpqi): - cachingStorCli / storCli2 / arcConf return ErrNotVirtualDrive once the controller has been queried successfully but the path matches no VD/LD. mpi3mr and smartpqi track this with an explicit isPhysical flag and accumulate query errors in queryErrs (mpi3mr also shadows the stdlib errors import cleanly, and fixes a latent bug where a non-SSD VD would fall through instead of returning HDD). - Drop the stub IsSysPathRAID(string) bool method from the MegaRaid, Mpi3mr and SmartPqi interfaces and their implementations. All three had been returning false unconditionally; the membership check is now the responsibility of linuxSystem only, removing dead code from the driver layer. linux package reorg: - Move IsSysPathRAID and GetSysPaths out of linux/util.go into a new linux/sysfs.go (same package, no API break); util.go is trimmed accordingly. - Remove IsSysPathRAID from the RAIDController interface. Tests: - linux/system_test.go (new): mock-based coverage of resolveDiskType for RAID VD match, ErrNotVirtualDrive fallback to udev, wrapped sentinel detection, genuine controller errors, multi-controller ordering, and a non-RAID/virtio disk falling straight through to getDiskType. - linux/sysfs_test.go (new): hermetic tempdir-based tests for GetSysPaths (single/multi BDF, missing dir, broken symlinks, module symlink skipped, malformed glob pattern) and IsSysPathRAID (no-host gate, /sys auto-prepend, unresolvable path, no-match fall-through, and a positive-match case that synthesises a fake driver dir whose BDF symlink targets a real /sys device - exercising the full path without requiring root or specific hardware). - megaraid/storcli_test.go: new coverage for the ErrNotVirtualDrive return path from cachingStorCli.GetDiskType. Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
When a disk is owned by a RAID HBA but isn't part of a VD/LD (JBOD or passthrough), we currently return ErrNotVirtualDrive and let scanning fall through to generic udev detection - losing the controller's authoritative media-type information. Now that the controller query has already succeeded, correlate each JBOD with its matching controller record and use that record's medium, while still falling back when no unambiguous match exists. Interface change: RAIDController.GetDiskType gains a disko.UdevInfo parameter (breaking change, propagated to the MegaRaid, Mpi3mr and SmartPqi driver interfaces and their in-tree implementations and mocks). linux/system.go's resolveDiskType simply threads udInfo through. Per-driver correlation: - megaraid: parse the SCSI target T from /sys/block/<kname>/device and match it against Drive.DID; classify via Drive.MediaType. - mpi3mr: same T, matched against PhysicalDrive.PID across all inspected controllers; classify via PhysicalDrive.Medium. - smartpqi: SCSI T doesn't match controller IDs here, so match udev ID_SERIAL_SHORT (with ID_SERIAL as fallback) against PhysicalDevice.SerialNumber; classify via PhysicalDevice.Type. Both SCSI-based drivers inject the helper via sysRoot + scsiTargetFn fields so tests can stub sysfs without touching /sys. Unit tests cover match, no-match, ambiguous-match, unresolved-target and missing-name cases per driver. Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Move IsSysPathRAID, GetSysPaths and readSCSITarget into a new linux/sysfs package (with readSCSITarget exported as ReadSCSITarget) so driver packages and the top-level linux package can share one implementation. The subpackage location avoids an import cycle since linux already depends on megaraid/mpi3mr/smartpqi. Files: - linux/sysfs.go -> linux/sysfs/sysfs.go (+package doc) - linux/sysfs_test.go -> linux/sysfs/sysfs_test.go - megaraid/scsi.go -> linux/sysfs/scsi.go (renamed/exported, generic doc covering both megaraid DID and mpi3mr PID) - megaraid/scsi_test.go -> linux/sysfs/scsi_test.go Callers: - linux/system.go: default isSysPathRAID is now sysfs.IsSysPathRAID. - megaraid/storcli.go, mpi3mr/storcli2.go: default scsiTargetFn is now sysfs.ReadSCSITarget. Injection type is unchanged so existing stub tests keep compiling. - mpi3mr/storcli2_test.go: the one test that referenced the production helper directly now uses sysfs.ReadSCSITarget. - linux/util.go: add thin IsSysPathRAID and GetSysPaths wrappers that forward to the sysfs package, marked // Deprecated: so existing external callers of the top-level linux package keep working. Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Replace the SCSI-target/Drive.DID pairing used by cachingStorCli with serial-number matching, aligning megaraid with the smartpqi driver. Query issues a best-effort `storcli /cN/eall/sall show all` whose output is parsed into (EID, Slot) -> SN and stamped onto the new Drive.SerialNumber field. jbodDiskTypeFromSerial then matches udev ID_SERIAL_SHORT/ID_SERIAL against it and maps Drive.MediaType (HDD/SSD/new NVME) to disko.DiskType; sysRoot/scsiTargetFn and the linux/sysfs import are dropped. Tests cover the new parser, SN propagation through newController, and the serial-based GetDiskType paths. Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Drivers previously returned disko.HDD as a placeholder on every error path, which is indistinguishable from a real HDD classification. Add a new DiskType, Unknown, reserved for error returns. - disk.go: new Unknown constant, String()/StringToDiskType updated; default unknown-string fallback remains HDD (public contract intact). - megaraid/storcli.go, mpi3mr/storcli2.go, smartpqi/arcconf.go: all GetDiskType error returns switch HDD -> Unknown. arcconf.go also fixes one %s -> %w in the top-level wrap so errors.Is works consistently across drivers. - linux/system.go: resolveDiskType returns Unknown on error, and adds a defense-in-depth guard that rejects a (Unknown, nil) driver return with a hard error so Unknown never reaches disko.Disk.Type. - Tests: update driver and system mocks to match the new contract and add a case for the new leak guard. Unknown is paired with a non-nil error in every return path and is never written to disko.Disk.Type, so no downstream consumer (storage, bootmgr, spm, bootstrap, configmgr) needs any change. Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
raharper
left a comment
There was a problem hiding this comment.
This is much improved. Thanks. Left some more inline comments/requests. Thank for working on this.
- Hoist collectUdevSerials into UdevInfo.CollectSerials, dropping the duplicate copies in megaraid and smartpqi. - linux/sysfs.ReadSCSITarget: expand Host:Controller:Target:LUN doc with a symlink example, document the ID_SCSI=1 upstream-filter contract, rename internals, and return a real error on malformed link targets. - mpi3mr.jbodDiskTypeFromSCSI: short-circuit on ID_SCSI != "1" so virtio-blk, NVMe and ATA/SATA never trigger a sysfs lookup. - linux.resolveDiskType: trim the defense-in-depth comment and note the RAID-binding skip. - Migrate the tests to testify/assert and require; use realistic DEVPATHs in linux/system_test. Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
raharper
left a comment
There was a problem hiding this comment.
Thanks for addressing the comments. I left a couple more, but I don't think they are blockers.
| // error. sysRoot is injectable for tests; production passes "/sys". | ||
| func ReadSCSITarget(sysRoot, kname string) (target int, ok bool, err error) { | ||
| if kname == "" { | ||
| return 0, false, nil |
There was a problem hiding this comment.
| return 0, false, nil | |
| return 0, false, fmt.Errorf("invalid empty 'kname' parameter") |
There was a problem hiding this comment.
I don't understand why we'd call this if kname is empty
There was a problem hiding this comment.
We don't. I made the change.
Reject empty kname in linux/sysfs.ReadSCSITarget with an explicit error; every production call site already filters on udInfo.Name, so an empty kname is a caller bug. In mpi3mr/storcli2.jbodDiskTypeFromSCSI, classify Medium="NVMe" as disko.NVME for parity with megaraid's jbodDiskTypeFromSerial. Signed-off-by: Andrei Aaron <andreifdaaron@gmail.com>
raharper
left a comment
There was a problem hiding this comment.
Thanks for working through all of the changes!
When a disk is visible through a RAID's sysfs tree but has no
corresponding virtual/logical drive entry (e.g. JBOD or passthrough
mode), GetDiskType previously returned a hard error that aborted disk
scanning entirely. This broke ScanAllDisks on systems with mixed RAID
and JBOD configurations.
Introduce the ErrNotVirtualDrive sentinel so RAID drivers can signal
"I queried successfully but this device isn't one of my VDs" vs a real
failure. In linux/system.go, consolidate the RAID-query and udev
classifier paths into resolveDiskType: on ErrNotVirtualDrive (wrapped
or not) it falls through to the generic sysfs-based detection instead
of failing; genuine controller errors still propagate. ScanDisk and
GetDiskType now share this single code path.
Changes across all three RAID drivers (megaraid, mpi3mr, smartpqi):
device path matches no virtual/logical drive.
returning false unconditionally.
Extract IsSysPathRAID and GetSysPaths into a new linux/sysfs package
so the drivers can use them directly without a circular import back
into the linux package, and so the logic becomes independently
mockable. linux/util.go keeps thin, deprecated wrappers for backwards
compatibility.
Tests:
RAID match, JBOD/ErrNotVirtualDrive fallback to udev, wrapped
sentinel detection, real errors, multi-controller ordering, and a
virtio-disk fallback case.
return path.
GetSysPaths (single/multi BDF, missing dir, broken symlinks,
malformed glob pattern) and IsSysPathRAID (no-host gate, /sys
auto-prepend, unresolvable path, no-match fall-through, and one
positive-match test that synthesises a fake driver dir whose BDF
symlink targets a real /sys device, which gives 100% coverage of
the new package without requiring root or special hardware).