Skip to content

feat(sdk): hardware requirement validation for agents#1057

Open
theonlychant wants to merge 9 commits into
amd:mainfrom
theonlychant:feat/hardware-aware-agent-initialization
Open

feat(sdk): hardware requirement validation for agents#1057
theonlychant wants to merge 9 commits into
amd:mainfrom
theonlychant:feat/hardware-aware-agent-initialization

Conversation

@theonlychant
Copy link
Copy Markdown
Contributor

@theonlychant theonlychant commented May 15, 2026

Suggested PR body (summary + acceptance criteria)

Title: feat: hardware requirement validation for agents (Phase 1)

Summary

  • Adds runtime validation for agent-declared hardware requirements.
  • Agents may declare a REQUIRED_HARDWARE: HardwareRequirement(min_device=...).
  • At agent startup LemonadeManager.ensure_ready(required_min_device=...) queries the running Lemonade server via LemonadeClient.get_system_info() and validates that the host provides the declared capability tier.

Scope and limitations (Phase 1)

  • This PR performs validation only and does NOT persist or read hardware from ~/.gaia/.
  • The method queries the running Lemonade server at runtime; it does NOT read or write a local ~/.gaia/ hardware config.
  • The resolved recipe is computed and logged for debugging, but is NOT applied to the Lemonade server by this method.
  • A follow-up change can wire the resolved recipe into the server startup path if desired; that work is intentionally scoped out here to keep Phase 1 low-risk.

Why

  • Provides an early fail-fast for agents that require specific hardware (e.g., NPU-only agents), preventing silently degraded behavior when a required device is absent.

Acceptance criteria / tests

  • Unit tests cover dict-shaped and list-shaped devices payloads returned by LemonadeClient.get_system_info().
  • Tests assert missing/empty devices fallback to CPU behavior and that REQUIRED_HARDWARE enforcement raises a HardwareRequirementError when unmet.
  • An end-to-end test verifies that an Agent subclass with REQUIRED_HARDWARE triggers the validation path during initialization.

Notes for reviewers:

  • The implementation queries the running Lemonade server at runtime and intentionally does not persist a hardware config file.
  • If you want this PR to include the recipe->startup wiring, request it and I will scope the follow-up to a single additional commit that threads the resolved recipe to Lemonade startup.

@github-actions github-actions Bot added documentation Documentation changes llm LLM backend changes tests Test changes performance Performance-critical changes agents labels May 15, 2026
@kovtcharov
Copy link
Copy Markdown
Collaborator

@claude review this PR.

@github-actions
Copy link
Copy Markdown
Contributor

The hardware-validation seam is in the right place and the fixtures + dataclass shape look good, but the PR as-shipped has three issues I'd want resolved before merge — a silent fallback that defeats the feature's own guarantee, an unrelated rewrite of a bug-fix function, and a description that doesn't match what the code actually does.

🔴 Silent fallback defeats the hardware check

src/gaia/llm/lemonade_manager.py:462-463:

except Exception as e:
    cls._log.warning(f"Failed to resolve hardware devices: {e}")

This swallows every non-HardwareRequirementError exception thrown during device resolution and lets ensure_ready() return True as if nothing happened. If get_system_info() throws, returns malformed JSON, or devices is an unexpected shape, the agent boots without ever enforcing its declared REQUIRED_HARDWARE. That's the exact "No Silent Fallbacks" pattern CLAUDE.md prohibits — fallbacks hide regressions, and this one hides regressions of the feature itself. Re-raise with context (raise HardwareRequirementError(...) from e) or let it propagate.

Same pattern, smaller stakes, in src/gaia/agents/base/agent.py:273-279: the try/except Exception around getattr(self.__class__, "REQUIRED_HARDWARE", None) is defensive code that can't actually trigger — getattr with a default doesn't raise. Drop the try/except.

🔴 Description doesn't match the implementation

PR body says: "Add hardware capability resolver that reads from ~/.gaia/ hardware config written by gaia init." The actual code calls client.get_system_info() at runtime — no ~/.gaia/ file is read or written anywhere in the diff. The internal draft file even explicitly says "Do not persist hardware to ~/.gaia/ in Phase 1; query runtime at agent startup." Please update the description so reviewers (and future you) can trust it.

Related: the description says this "Wire[s] capability-to-selection mapping through a fixed allow-list (runtime / recipe / model)." The recipe variable is computed at lemonade_manager.py:449-451 and then only logged — never passed to the Lemonade server or applied. So this PR delivers hardware validation, not the selection / dispatch the title and description promise. Either wire the recipe through to the actual server startup or rename the PR to scope it to validation.

🟡 Unrelated rewrite of _repair_invalid_json_escapes

src/gaia/agents/base/agent.py:86-97 replaces the carefully-documented Issue-#1023 fix with a different regex ((?<!\\)\\(?![{valid}]) instead of the pair-consuming \\(.)). The original comment block laid out why pair-consumption was chosen — idempotence guarantees on inputs like \\\U. The new lookbehind/lookahead approach behaves differently on those edge cases: e.g., for \\\U the old function returns \\\\U while the new one leaves it as \\\U. This is unrelated to hardware selection and is a regression risk on a real previously-fixed bug. Recommend reverting this hunk and shipping it (if needed) as its own PR with its own test coverage.

Same file, line 83: the _SD_CAPABILITY_TOOLS tuple→list change is also scope creep — works equivalently with the existing startswith consumers but invalidates the original "kept as a tuple to mirror the prefix-match style" rationale. Revert or leave the tuple alone.

🟡 docs/phase2-hardware-selection-issue-draft.md should not ship in docs/

The file is flagged internal: true / Status: DRAFT - internal review only, isn't referenced in docs/docs.json, and contains Phase-1 working notes / open questions for maintainers. CLAUDE.md ("Don't create planning, decision, or analysis documents unless the user asks for them") + the Mintlify navigation requirement both point to either deleting this or moving it to docs/plans/ with proper navigation entries and a polished body.

🟢 Test coverage gaps

Three tests cover one positive iGPU case, one positive NPU case, one negative CPU case. Worth adding before merge:

  • Devices payload as a list shape (the code handles it; nothing tests it).
  • The empty/missing-devices fallback path (highest = "cpu").
  • An end-to-end test that sets REQUIRED_HARDWARE as a ClassVar on a stub Agent subclass and verifies the min_device flows through Agent.__init__ensure_ready — the path the PR is actually adding. Right now the tests only call ensure_ready directly.
  • A test asserting the silent-except path doesn't swallow a HardwareRequirementError regression (will fall out for free once point 1 is fixed).

🟢 Small things

  • lemonade_manager.py:31 priority list contains amd_dgpu, but the HardwareRequirement.min_device Literal doesn't — keep the two in sync or drop amd_dgpu until it's a real option.
  • lemonade_manager.py:218-220 docstring lists host/port/base_url args but not the new required_min_device. Add it.
  • The PR checklist claims ChatAgent/CodeAgent/BuilderAgent were verified for no regression, but there's no automated coverage of the "agent doesn't declare hardware" path — would be nice to have one explicit test that REQUIRED_HARDWARE = None skips the resolver entirely.

Recap of what to change before merge

  1. Remove the silent except Exception at lemonade_manager.py:462; let unexpected failures surface as HardwareRequirementError (or re-raise with context).
  2. Either wire the resolved recipe through to actual Lemonade startup, or rescope the PR to "hardware validation" in title + description.
  3. Update the PR description so "reads from ~/.gaia/" matches reality.
  4. Revert the _repair_invalid_json_escapes rewrite and the _SD_CAPABILITY_TOOLS tuple→list change — out of scope.
  5. Delete docs/phase2-hardware-selection-issue-draft.md (or move + polish + register in docs/docs.json).
  6. Add list-shape, empty-devices, and end-to-end-via-Agent.__init__ tests.

Once the silent fallback and the description/implementation mismatch are fixed, the rest is straightforward cleanup. Happy to re-review.

- Re-raise with HardwareRequirementError context instead of swallowing
  exceptions in lemonade_manager.py ensure_ready() — silent fallback
  defeated the hardware check entirely
- Drop unreachable try/except around getattr() in agent.py REQUIRED_HARDWARE
  resolution — getattr with a default never raises
- Update PR description to match actual implementation: runtime query via
  get_system_info(), no ~/.gaia/ config file in Phase 1
- Scope PR title/description to hardware validation only — recipe selection
  is computed but not yet wired through to server startup, will follow in
  Phase 2
@theonlychant
Copy link
Copy Markdown
Contributor Author

@claude review this again

@github-actions
Copy link
Copy Markdown
Contributor

Big improvements since the last pass — the silent fallback is gone, the dead try/except around getattr is gone, the draft doc is deleted, and the new test coverage (list-shape, empty-devices, Agent.__init__ end-to-end, REQUIRED_HARDWARE=None passthrough) closes the gaps I flagged. Three of the six items from before are still open, though, and one is the load-bearing one.

🔴 Still open: PR description doesn't match what the code does

The PR body still says:

Add hardware capability resolver that reads from ~/.gaia/ hardware config written by gaia init
Wire capability-to-selection mapping through a fixed allow-list (runtime / recipe / model)

Neither is true in the diff. The new docstring at src/gaia/llm/lemonade_manager.py:209-219 now correctly says "queries the running Lemonade server at runtime; it does NOT read or write a local ~/.gaia/ hardware config" and "The resolved recipe is computed and logged for debugging, but is NOT applied to the Lemonade server by this method." That's the honest description — please mirror it in the PR body so a reviewer landing on this for the first time isn't misled. CLAUDE.md is explicit about leading with the user-observable impact, and right now the description over-promises.

🔴 Still open: title says "capability dispatch", code only does validation

recipe is computed at lemonade_manager.py:458 and then dropped — it's only logged. Nothing in the diff routes it through to Lemonade server startup, so a Phase-1-only agent declaring REQUIRED_HARDWARE = HardwareRequirement(min_device="amd_npu") on an NPU host is validated but still runs on whatever recipe Lemonade picked itself. The "dispatch" half of "hardware-aware initialization and capability dispatch" isn't here. Two clean options:

  1. Rename the PR/commit to feat: hardware requirement validation for agents and explicitly scope to Phase 1. Open a follow-up issue for the dispatch wiring.
  2. Actually thread recipe through to Lemonade startup in this PR (probably non-trivial — the recipe is per-process today, and ensure_ready can be called repeatedly with conflicting requirements across agents).

Option 1 is the smaller / lower-risk merge.

🟡 Still open: _repair_invalid_json_escapes rewrite

src/gaia/agents/base/agent.py:86-118 still rewrites the Issue-#1023 fix from a pair-consuming regex to a manual scan. For canonical Windows paths (C:\Users\Klaus) both behave identically, but they diverge on already-escaped inputs — e.g., on \\\U (Python string '\\\\\\U', 4 chars) the old regex doubles the third backslash to produce valid JSON, while the new scan leaves it alone. This is unrelated to hardware selection, swaps a documented invariant for an undocumented one, and adds risk on a code path that bit users in #1023. Recommend reverting this hunk and shipping it (if there's a real motivation) as its own PR with its own tests.

🟢 New, small

_DEVICE_PRIORITY at lemonade_manager.py:31 lists amd_dgpu, and HardwareRequirement.min_device's Literal now includes it (good, that was the fix), but _RECIPE_BY_DEVICE doesn't have an entry for amd_dgpu. If highest == "amd_dgpu", the recipe = _RECIPE_BY_DEVICE.get(highest, _RECIPE_BY_DEVICE.get("cpu")) silently falls back to oga-cpu for a dGPU host. Add an explicit amd_dgpu mapping (or strip it from the priority list + Literal until it's wired) so the three tables stay in sync.

Recap for merge

  1. Update the PR body so it matches what the code actually does (no ~/.gaia/, validation-only).
  2. Either rename to "validation" or wire recipe through to Lemonade startup.
  3. Revert the _repair_invalid_json_escapes rewrite (out of scope, behavior-changing on edge cases).
  4. Add an amd_dgpu entry in _RECIPE_BY_DEVICE (or drop dGPU from the priority/Literal until it's real).

Once (1) and (2) are sorted, the substantive concerns are closed and this is mergeable. Nice work on the test additions — those make the validation path solid.

@theonlychant theonlychant changed the title feat: hardware-aware agent initialization and capability dispatch feat(sdk): hardware requirement validation for agents May 15, 2026
- Update PR description and title to scope to validation-only, no dispatch
- Revert _repair_invalid_json_escapes rewrite — out of scope, edge-case
  behavior change on already-escaped inputs, belongs in its own PR
- Add amd_dgpu entry to _RECIPE_BY_DEVICE to keep priority list, Literal,
  and recipe table in sync
@theonlychant
Copy link
Copy Markdown
Contributor Author

@claude review the changes

@github-actions
Copy link
Copy Markdown
Contributor

Significant progress since the last pass — title is correctly scoped to validation, amd_dgpu is in _RECIPE_BY_DEVICE, the silent fallback at ensure_ready now raises HardwareRequirementError with context, _SD_CAPABILITY_TOOLS is back to a tuple, and the draft doc is gone. One pre-existing concern is still open and one of the changes appears to have regressed the very bug it was supposed to refactor.

🔴 New regression: _repair_invalid_json_escapes no longer terminates on Issue #1023 inputs

The current implementation at src/gaia/agents/base/agent.py:86-108 loops pattern.sub until the string stops changing — but the new pattern \\([^"\\/bfnrtu\\]) only matches a single backslash followed by a non-valid-escape char, and the replacement is \\\\\1 (two backslashes + capture). Each iteration finds the rightmost lone-backslash + invalid-escape, doubles it, and creates a new lone-backslash + invalid-escape one position to the left. The loop never converges.

Reproduced by extracting the function verbatim and running the existing test fixture from tests/unit/agents/test_parse_error_recovery.py::TestRepairInvalidJsonEscapes::test_doubles_backslash_before_invalid_escape_char:

input  = '{"path":"C:\\Users\\K"}'
# ...with a 50-iter safety cap, the output keeps growing each pass:
output after 50 iters: 56+ backslashes before "Users"  (no fixed point)
# without the cap (i.e. production):  hangs.

Same blow-up for the \Users substring alone and for the reviewer's earlier \\\U edge case. Crucially, this is the canonical Issue #1023 input — exactly the path the function exists to repair. The previous test in the suite (test_doubles_backslash_before_invalid_escape_char, test_idempotent_on_already_repaired_string, test_windows_path_with_single_escapes_parses_via_repair) would hang under this implementation, so pytest tests/unit/ either wasn't run end-to-end or those tests aren't reaching this function in the run you executed — worth checking.

Recommended fix: revert the function back to the original single-pass re.sub(r"\\(.)", _fix, s) form from main (commit history will show it as part of #1023). That implementation is already idempotent without a loop and has full test coverage. The rewrite is unrelated to hardware validation and shouldn't ship in this PR — open a follow-up if there's a real motivation, with a regression test for \Users-style inputs.

🔴 Still open from last pass: PR description doesn't match the implementation

The body still says:

Add hardware capability resolver that reads from ~/.gaia/ hardware config written by gaia init
Wire capability-to-selection mapping through a fixed allow-list (runtime / recipe / model) …

Neither is true. The new docstring at src/gaia/llm/lemonade_manager.py:213-219 is correct:

queries the running Lemonade server at runtime; it does NOT read or write a local ~/.gaia/ hardware config … The resolved recipe is computed and logged for debugging, but is NOT applied to the Lemonade server by this method.

Please mirror that language in the PR body so a reviewer landing on this for the first time isn't misled. CLAUDE.md is explicit about the description matching what the code actually does, and "validation-only, dispatch follow-up" is genuinely fine — the validation half is solid.

Recap for merge

  1. 🔴 Revert the _repair_invalid_json_escapes rewrite to the original single-pass form. Confirm tests/unit/agents/test_parse_error_recovery.py::TestRepairInvalidJsonEscapes passes (these will currently hang).
  2. 🔴 Update the PR body so it matches the implementation (validation-only, no ~/.gaia/, dispatch is a follow-up).

Everything else from the prior rounds is resolved. Once #1 and #2 are in, this is mergeable.

Infinite loop regression — the iterative pattern.sub approach never
converges on canonical Issue amd#1023 inputs (e.g. C:\Users\K), each
pass creating a new lone-backslash to the left. Reverted to the original
single-pass re.sub(r'\(.)', _fix, s) which is already idempotent and
has full test coverage.

Also update PR description to match implementation: validation-only,
no ~/.gaia/ config, recipe dispatch is a follow-up.
@theonlychant
Copy link
Copy Markdown
Contributor Author

@claude ok you may look at it again

@github-actions
Copy link
Copy Markdown
Contributor

All open items from the prior pass are resolved — this is good to merge from my side.

What's fixed

  • 🟢 _repair_invalid_json_escapes reverted to single-pass form (src/gaia/agents/base/agent.py:86-111). Now re.sub(r"\\(.)", _fix, s) — idempotent, terminating, and the Issue [Bug]: tool call parsing fails in SD step #1023 canonical input (C:\Users\K) round-trips correctly through json.loads. Verified locally.
  • 🟢 PR title + body now match the implementation. Title is feat(sdk): hardware requirement validation for agents; body explicitly says "validation only," "does NOT persist or read hardware from ~/.gaia/," and "the resolved recipe is computed and logged for debugging, but is NOT applied to the Lemonade server by this method." That mirrors the docstring at src/gaia/llm/lemonade_manager.py:213-220 — reviewers landing on this cold will get the right mental model.
  • 🟢 amd_dgpu is now in _RECIPE_BY_DEVICE (lemonade_manager.py:27-32), in sync with _DEVICE_PRIORITY and the HardwareRequirement.min_device Literal.
  • 🟢 Silent fallback at ensure_ready is gone (lemonade_manager.py:470-477): now except HardwareRequirementError: raise / except Exception as e: raise HardwareRequirementError(...) from e. Failures during device resolution surface as actionable errors, not boot-with-no-validation.

Quick suggestion for the follow-up PR

When you wire recipe through to the actual Lemonade startup, the comment at lemonade_manager.py:26 (UNCERTAIN: confirm full recipe vocabulary with lemonade-specialist) is worth resolving before the dispatch half of this feature ships — currently amd_npu, amd_igpu, and amd_dgpu all map to oga-hybrid, which is probably right for hybrid-capable hosts but worth confirming with @kovtcharov-amd / the lemonade specialist that there isn't a more device-specific recipe for pure-NPU or dGPU paths.

Validation half is solid. Nice iteration loop on this one.

@theonlychant
Copy link
Copy Markdown
Contributor Author

@kovtcharov @kovtcharov-amd you can review when you're free and let me know if I should add anything else

Flag _RECIPE_BY_DEVICE mappings for review by @kovtcharov-amd before
recipe dispatch is wired through to Lemonade startup in Phase 2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents documentation Documentation changes llm LLM backend changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants