kul-optec · hakkelt · Apr 3, 2026 · Apr 3, 2026 · Apr 21, 2026 · Apr 21, 2026
diff --git a/.github/agents/julia.agent.md b/.github/agents/julia.agent.md
@@ -0,0 +1,116 @@
+---
+description: "Use when improving Julia code quality with very long test suites, slow CI, TestItemRunner tagging/filtering, iterative fix-and-rerun loops, or flaky tests. Keywords: Julia, TestItemRunner, @testitem, tags, filter, long-running Julia process, code quality, assertions, source fixes."
+name: "Julia Long-Test Quality"
+tools: [read, search, edit, execute, todo]
+user-invocable: true
+---
+You are a specialist for improving Julia code quality in repositories with long-running test suites.
+
+## Mission
+- Make tests reliable and informative without weakening test intent.
+- Use TestItemRunner capabilities to speed iteration and triage by tags and filters.
+- Iterate until the targeted test scope passes, then validate broader scopes.
+
+## Hard Constraints
+- Never remove assertions to make tests pass.
+- If a failure reflects a real implementation bug, fix source code instead of loosening tests.
+- Preserve operator names in tags exactly as implemented (CamelCase, no renamed variants).
+- Keep changes minimal and localized; avoid unrelated refactors.
+
+## Repository-Specific Engineering Rules
+- Respect package structure and boundaries:
+   - `src/linearoperators/` for concrete linear operators.
+   - `src/nonlinearoperators/` for nonlinear operators.
+   - `src/calculus/` for operator calculus/composition.
+   - `src/batching/` for batch operators.
+- For new or changed operators, ensure implementation completeness:
+   - Struct with concrete, inference-friendly field types.
+   - Constructors for dimension tuple and/or data-driven construction.
+   - Forward path `mul!(y, op, x)` and adjoint path dispatch via `AdjointOperator`.
+   - Trait and property behavior remains consistent (`is_linear`, `is_diagonal`, rank/invertibility traits).
+   - Storage traits stay valid (`domain_array_type`, `codomain_array_type`) for CPU/GPU paths.
+- Prefer `copy_operator(op; array_type=nothing, threaded=nothing)` behavior when changing copy semantics:
+   - Deep-copy mutable working buffers only.
+   - Share immutable and read-only references.
+- Keep test files standalone-capable and aligned with TestItems setup modules.
+- Preserve quality gates: JET, Aqua, and doctests should remain passing together.
+- Use Runic formatting checks when editing Julia source or tests.
+
+## Julia Performance Playbook
+- Put performance-critical code in functions, not top-level scope.
+- Avoid untyped globals in hot paths; use function arguments and `const` globals where appropriate.
+- Prefer concrete field/container types; avoid abstract fields like `Function`, `AbstractArray`, or `Integer` in performance-sensitive structs.
+- Maintain type stability:
+   - Avoid variable type changes within loops.
+   - Use `zero(x)`, `oneunit(T)`, and stable return types.
+   - Use function barriers for setup-vs-kernel separation.
+- Measure, do not guess:
+   - Use `BenchmarkTools` for benchmarks.
+   - Track allocations (`@time`, `@allocated`) and treat unexpected allocations as defects to investigate.
+   - Use `@code_warntype` and JET to diagnose inference issues.
+- Minimize allocations in inner loops:
+   - Preallocate outputs and favor `mul!`/in-place APIs.
+   - Use broadcast fusion (`@.` / dotted ops) when beneficial.
+   - Unfuse broadcasts when repeated subexpressions are recomputed unnecessarily.
+   - Use `@views` for slicing when copy cost dominates.
+- Iterate arrays in memory-friendly order (column-major access patterns).
+- For threaded Julia code that also calls BLAS, avoid oversubscription (often `OPENBLAS_NUM_THREADS=1` is best with multithreaded Julia; validate on workload).
+- Use performance annotations (`@inbounds`, `@simd`, `@fastmath`) only when correctness assumptions are explicitly validated.
+
+## Test Architecture Rules
+- Prefer `@testitem` with explicit `tags` and optional `setup` modules.
+- Use tags that encode both operator and test type.
+- Mixed tests may include multiple operator tags when behavior genuinely spans operators.
+- Test type tags should come from: `:linearoperator`, `:nonlinearoperator`, `:batching`, `:calculus`, `:jet`, `:quality`, `:misc`.
+- Operator tags should use exact CamelCase names, for example: `:MatrixOp`, `:FiniteDiff`, `:Compose`, `:SpreadingBatchOp`.
+- Use `@run_package_tests filter=ti->...` to run focused slices.
+- For grouped runs, prefer strict type-tag exclusion filters (for example, `ti -> !(:jet in ti.tags)`).
+
+## JET.jl Requirements
+- Treat JET coverage as mandatory for all public API.
+- Ensure JET test coverage includes all three modes:
+   - `JET.test_package(...)` for package-level inference/type diagnostics on exported/public API paths.
+   - `@test_opt ...` for representative public operations and constructors.
+   - `@test_call ...` for key public call signatures and runtime-like call paths.
+- Do not accept partial JET migration: missing any of the three test modes is incomplete.
+- When adding or changing public API, update JET tests in the same change.
+
+## Fast Iteration Workflow
+1. Start one long-running Julia REPL in the package test environment.
+2. Load TestItemRunner once.
+3. Run filtered test slices repeatedly (by operator/type tags).
+4. Fix failures immediately; rerun the same filtered slice until green.
+5. Expand to adjacent slices, then run full suite.
+6. Capture outputs from each run into `.temp/` files for traceability.
+
+Recommended REPL pattern:
+```julia
+using TestItemRunner
+run_tests("test"; filter = ti -> (:MatrixOp in ti.tags) && (:linearoperator in ti.tags))
+```
+
+Recommended shell pattern for captured logs:
+```sh
+mkdir -p .temp
+julia --project=test test/runtests.jl > .temp/test_runtests.log 2>&1
+julia --project=test test/jet/test_package.jl > .temp/test_jet_package.log 2>&1
+```
+
+## Failure Triage
+1. Read the exact failing assertion and stacktrace first.
+2. Classify failure:
+   - Test setup/import/tagging issue
+   - Real source bug
+   - Environment/performance instability
+3. For real bugs, patch source and keep/assert expected behavior in tests.
+4. For flaky perf tests, stabilize methodology (workload, sampling, thresholds) without dropping coverage.
+5. Re-run the smallest relevant filtered subset before broad reruns.
+
+## Output Requirements
+- Report what was changed and why.
+- List files touched.
+- Provide exact filtered test commands used.
+- State pass/fail counts for the final run.
+- Call out remaining risks or follow-up items.
+- IMPORTANT! Store all temporary run outputs only under `.temp/` inside the repository (no temp scripts and logs elsewhere).
+- When performance work is included, report allocation deltas and the exact benchmark commands used.
diff --git a/.github/instructions/julia-operator-engineering.instructions.md b/.github/instructions/julia-operator-engineering.instructions.md
@@ -18,7 +18,12 @@ applyTo: "src/**/*.jl"
   - size/domain/codomain/storage traits,
   - property traits such as linearity, diagonal structure, and rank-related predicates.
 - `check` utility function must be called in all effective `mul!` paths to ensure consistent argument validation and error messages.
-- Preserve `domain_storage_type` and `codomain_storage_type` semantics and dispatch compatibility.
+- Preserve `domain_array_type` and `codomain_array_type` semantics and dispatch compatibility.
+- Constructors should expose an `array_type` keyword where storage backend selection is meaningful.
+- `domain_array_type`/`codomain_array_type` must remain consistent with constructor-selected storage.
+- When storage checks become stricter, fix operator traits and tests instead of relaxing `check`.
 - Prefer behavior-preserving refactors: extract helpers, separate setup from kernels, reduce method size, but do not weaken checks.
 - If modifying copy semantics, preserve the package convention that immutable/read-only arrays are shared while mutable working buffers are copied deliberately.
 - Keep source formatted with Runic-compatible Julia style.
+- GPU extensions live under `ext/GpuExt/` (triggered by `GPUArrays`). Override `mul!` there for any operator whose base implementation uses scalar indexing loops (`@nloops`, `@nref`, `@inbounds y[i] = b[j]`); replace with broadcast-over-view (`y .= view(b, idx...)`).
+- When overriding a threaded operator (e.g. `Variation{..., true}`) for GPU, delegate to the non-threaded variant (`Variation{..., false}`) — the threading strategy is CPU-only.
diff --git a/.github/instructions/julia-performance.instructions.md b/.github/instructions/julia-performance.instructions.md
@@ -24,3 +24,5 @@ applyTo: "src/**/*.jl,benchmark/**/*.jl"
   - benchmark representative workloads,
   - inspect allocations,
   - use JET and `@code_warntype` for inference issues.
+- For benchmark harnesses, derive element types robustly when operator type traits may return wrapped array types.
+- Keep benchmark setup deterministic (`Random.seed!(0)`) and validate key benchmark states with one smoke `mul!` path before full runs.
diff --git a/.github/instructions/julia-testing-and-jet.instructions.md b/.github/instructions/julia-testing-and-jet.instructions.md
@@ -19,3 +19,13 @@ applyTo: "test/**/*.jl,docs/**/*.md"
 - Keep Aqua and doctests passing alongside functional tests.
 - Never remove assertions to force green tests.
 - All temporary test and benchmark outputs must go under `.temp/` only.
+- If GPU tests are backend-specific, keep them in separate `@testitem`s and use `:gpu` tag.
+- When `VERB` is enabled, print each running testitem name at test-runner filter time.
+- For local coverage, mirror CI with `julia --project=test --code-coverage=user test/runtests.jl`, then process `*.cov` / `*.info` artifacts into `lcov.info` if needed.
+- Subpackages (DSPOperators, FFTWOperators, NFFTOperators, WaveletOperators) have no standalone `test/` directory; they are tested and their coverage is gathered exclusively through the parent package's `test/` project. Do not attempt a separate subpackage coverage run.
+- Extension coverage should be gathered through the parent-package tests that load the relevant trigger packages; do not assume a separate extension-only coverage run exists.
+- JET `@test_opt` flags `array_type::Type` (unparameterized keyword) as a source of runtime dispatch. Use `array_type::Type{<:AbstractArray}` and avoid kwarg-to-kwarg forwarding; use a typed positional-arg helper (e.g., `_make_eye(T, dims, S)`) so JET can resolve dispatch statically.
+- When Aqua reports "Unexpected Pass" on a `@test_broken`/`broken=true` check, the underlying issue is now fixed — remove the workaround and use `Aqua.test_all(pkg)` unconditionally.
+- Agent sub-tasks frequently generate `Eye(T, dims, array_type)` (3 positional args) instead of `Eye(T, dims; array_type=...)` (keyword). Always verify agent output for this pattern.
+- Stochastic test assertions `op * randn(n) ≈ other_op * (op * randn(n))` are wrong when the two `randn` calls produce different vectors; always capture into a variable first.
+- When testing GPU storage-type propagation, add `@test domain_array_type(op) <: CUDA.CuArray` / `<: AMDGPU.ROCArray` assertions directly in the per-operator CUDA/AMDGPU `@testitem`.
diff --git a/.github/skills/julia-gpu-implementation/SKILL.md b/.github/skills/julia-gpu-implementation/SKILL.md
@@ -0,0 +1,50 @@
+---
+name: julia-gpu-implementation
+description: 'Use for GPU operator implementations, GPU extension fixes, backend-specific testitems, and GPU benchmark validation in AbstractOperators.jl.'
+argument-hint: 'Describe the operator, GPU backend, or benchmark you want to implement or validate'
+user-invocable: true
+---
+
+# Julia GPU Implementation
+
+## When To Use
+
+- Implementing or fixing GPU overrides under `ext/GpuExt/`.
+- Adding or updating CUDA/AMDGPU testitems.
+- Debugging backend-specific dispatch, storage traits, or array conversion issues.
+- Extending benchmark coverage for GPU behavior.
+- Checking whether a CPU operator should get a GPU path or stay CPU-only.
+
+## Implementation Rules
+
+- Julia package extensions can only `import` the parent package, trigger package(s), and stdlib; if extension code needs a parent dependency API, expose it from the parent module first.
+- For FFT plans, prefer `inv(plan)` (AbstractFFTs-generic) over backend-specific `FFTW.plan_inv(...)` to keep CUDA/AMDGPU compatibility.
+- With JLArrays/GPUArrays, avoid `copyto!(gpu, cpu_view)` where the source is a `SubArray`; materialize first, for example with `src[1:n]`, or copy from a plain array.
+- Preserve backend storage semantics and trait dispatch when adding GPU methods.
+- Keep CPU-only implementation details out of GPU overrides unless the backend truly supports them.
+- For GPU `GetIndex` overrides, keep boolean-mask and integer-vector fancy indexing in CPU paths unless the backend support is verified.
+- When overriding a threaded operator for GPU, delegate to the non-threaded variant; threading strategy is CPU-only.
+- Prefer direct `CuArray(arr)` / `CUDA.zeros(...)` / `AMDGPU.ROCArray(arr)` / `AMDGPU.zeros(...)` calls over intermediate conversion variables.
+- Benchmark setup code should normalize wrapped domain and codomain type traits to scalar element types before calling `randn` or `zeros`.
+
+## Testing Rules
+
+- For honest GPU coverage, keep JLArray checks separate from real device checks and add backend-specific tags such as `:cuda` and `:amdgpu` plus runtime skip guards.
+- In `test/runtests.jl`, filter backend-tagged testitems when the runtime is unavailable, but keep per-test safety checks too.
+- Add explicit tests for `domain_array_type` and `codomain_array_type`, and verify that `op * x` allocates on the active backend.
+- When adding CUDA/AMDGPU companion tests, prefer direct backend array construction instead of temporary conversion variables.
+- For GPU `GetIndex` tests, restrict indices to ranges, colons, and scalar integers; bool-mask and integer-vector `view` forms are not universally supported across GPU backends.
+- Migrate GPU-backend storage-type assertions from central quality files into each operator's own CUDA/AMDGPU `@testitem` so they run with the functional tests.
+- Use direct `import CUDA` / `import AMDGPU` plus `functional()` guards in testitems; avoid try/catch gating.
+
+## Benchmarking Rules
+
+- Benchmark scripts under `benchmark/` must prefer local workspace package paths over registry-installed copies, otherwise GPU fixes in sibling packages can be silently skipped.
+- Use representative large inputs for GPU crossover studies and keep the measurement setup deterministic.
+- Capture benchmark logs and generated reports under `.temp/`.
+
+## Tooling Reminders
+
+- Agent sub-tasks frequently generate `Eye(T, dims, array_type)` with three positional arguments instead of `Eye(T, dims; array_type=...)` with a keyword; verify this pattern.
+- JET `@test_opt` catches runtime dispatch from `array_type::Type` when it is unparameterized; use `array_type::Type{<:AbstractArray}` and avoid kwarg-to-kwarg forwarding by routing through an internal helper.
+- When fixing an "unexpected pass" Aqua error, remove the workaround and use `Aqua.test_all(pkg)` once the underlying issue is fixed.
diff --git a/.github/skills/julia-long-test-workflow/SKILL.md b/.github/skills/julia-long-test-workflow/SKILL.md
@@ -29,6 +29,24 @@ user-invocable: true
 
 ## Common Commands
 
+Main package coverage:
+
+```sh
+julia --project=test --code-coverage=user test/runtests.jl
+```
+
+Subpackage coverage (DSPOperators, FFTWOperators, NFFTOperators, WaveletOperators have **no** standalone `test/` directory):
+
+> All subpackage code and their GPU extensions are exercised by the parent package's
+> `test/` project. Run the same coverage command above; the `.cov` files under each
+> subpackage's `src/` will be populated automatically.
+
+Process coverage after a local run:
+
+```sh
+julia -e 'using Coverage; Coverage.LCOV.writefile("lcov.info", Coverage.process_folder())'
+```
+
 Filtered test run:
 
 ```julia
@@ -37,9 +55,13 @@ TestItemRunner.run_tests(pwd(); filter = ti -> :MatrixOp in ti.tags) # example o
 TestItemRunner.run_tests(pwd(); filter = ti -> ti.name == "DCT") # example of filtering by test name instead of tags
 ```
 
-AirSpeedVelocity comparison:
+### Local benchmark comparison with AirspeedVelocity
+
+AirspeedVelocity works well for local branch-vs-master comparisons and is the
+recommended tool for interactive performance investigation:
 
 ```sh
+mkdir -p .temp/asv
 benchpkg \
   --path . \
   --rev master,dirty \
@@ -48,6 +70,20 @@ benchpkg \
   --exeflags="--threads=4"
 ```
 
+Filtered AirSpeedVelocity comparison for a single benchmark family:
+
+```sh
+mkdir -p .temp/asv
+benchpkg \
+  --path . \
+  --rev master,dirty \
+  --script benchmark/benchmarks.jl \
+  --output-dir .temp/asv \
+  --exeflags="--threads=4" \
+  --add RecursiveArrayTools \
+  --filter MIMOFilt
+```
+
 Render a comparison table:
 
 ```sh
@@ -59,6 +95,44 @@ benchpkgtable \
   --mode time,memory
 ```
 
+> **Note:** Use AirspeedVelocity with an explicit `--script` path when comparing
+> against revisions that do not yet contain the benchmark file.
+
+### CI benchmark comparison (GitHub Actions)
+
+The GitHub Actions benchmark CI does **not** use the AirspeedVelocity action
+because the root-level Julia workspace (`[workspace]` in `Project.toml`) causes
+that action's revision-management to mis-resolve the monorepo subprojects.
+Instead, two workflows implement a fork-safe two-stage approach:
+
+- **`benchmark.yml`** – unprivileged `pull_request` job that checks out both
+  the base and head revisions, runs `benchmark/compare.jl` against explicit
+  worktree paths, and uploads `body.md`, `pr_number.txt`, and
+  `julia_version.txt` as an artifact.
+- **`post_benchmark_comment.yml`** – privileged `workflow_run` job that
+  downloads the artifact and creates or updates the PR comment.
+
+The comparison table mirrors AirspeedVelocity output with separate Time and
+Memory sections, base/head columns, a ratio column, and emoji indicators:
+- 🚀 significant speedup: `ratio − ratio_err > 1.2` (time) or `ratio < 0.5` (memory)
+- 🐢 significant slowdown: `ratio + ratio_err < 0.8` (time) or `ratio > 1.5` (memory)
+
+To run the comparison locally with the same script used by CI:
+
+```sh
+# Check out base separately, e.g. in a worktree:
+git worktree add .temp/base master
+
+julia --project=benchmark benchmark/compare.jl \
+  --base-dir  .temp/base \
+  --head-dir  . \
+  --output-dir .temp/bench-compare \
+  --pr        0 \
+  --julia-version "$(julia -e 'print(VERSION)')"
+
+cat .temp/bench-compare/body.md
+```
+
 ## Done Criteria
 
 - Targeted tests pass.

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -36,6 +36,7 @@ jobs:
         run: |
           julia --project=test -e '
             using Pkg
+            Pkg.rm(["DSPOperators", "FFTWOperators", "NFFTOperators", "WaveletOperators"])
             Pkg.develop(path = pwd())
             for pkg in ("DSPOperators", "FFTWOperators", "NFFTOperators", "WaveletOperators")
                 Pkg.develop(path = joinpath(pwd(), pkg))

diff --git a/.gitignore b/.gitignore
@@ -18,3 +18,5 @@ docs/Manifest.toml
 Manifest.toml
 Manifest-v*.toml
 .temp/
+test/gpu_env/
+benchmark/gpu_env/
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,25 @@
+# AGENTS.md
+
+This repository uses layered guidance. Follow it in this order:
+
+1. Read this file first.
+2. Read any applicable files under `.github/instructions/` whose `applyTo` pattern matches the files you will edit.
+3. Read the matching skill under `.github/skills/` when the task clearly matches a skill's scope.
+4. Then inspect the target source files before editing.
+
+## How to choose guidance
+
+- Use `julia-operator-engineering.instructions.md` for changes under `src/**/*.jl`.
+- Use `julia-performance.instructions.md` for code under `src/**/*.jl` and `benchmark/**/*.jl` when performance is relevant.
+- Use `julia-testing-and-jet.instructions.md` for `test/**/*.jl` and docs-backed test guidance.
+- Use `.github/skills/julia-long-test-workflow/SKILL.md` for long Julia test runs, filtered `TestItemRunner` work, JET triage, and AirspeedVelocity comparisons.
+- Use `.github/skills/julia-gpu-implementation/SKILL.md` for GPU operator implementations, GPU extensions, GPU-specific tests, and GPU benchmark validation.
+
+## Working rules
+
+- Prefer the smallest skill and instruction set that fully covers the task.
+- Do not ignore a matching instruction file because a skill also exists; use both when they apply.
+- If multiple instruction files match, combine them rather than choosing only one.
+- If a task touches both implementation and tests, read both the source and test instruction files before editing.
+- Keep temporary artifacts under `.temp/`.
+- When in doubt, inspect the relevant files before making changes.