RFC: Add bounded local model review by anders-heimer · Pull Request #235 · sashiko-dev/sashiko

anders-heimer · 2026-06-02T11:12:26Z

This RFC adds a bounded review mode for OpenAI-compatible local models and
tightens the review/benchmark accounting needed to evaluate that mode.

The main goal is to keep the provider selection as a transport choice while
allowing local models to opt into Sashiko-controlled review behavior: bounded
tool exploration, schema-only finalization, token preflight checks, typed budget
failures, and minimal fallback output when the full staged review cannot fit.

The series also preserves Stage 8/Stage 9 accounting through finalization and
benchmark reporting, so retained concerns, dropped candidates, final findings,
and failed/skipped rows are easier to audit.

I am currently running the 1000-patch benchmark on an L40S:

Reviews:
Failed 274
Reviewed 545
Skipped 7
In Review 1

Patch statuses:
Reviewed 545
FailedBudget 271
Failed 3
Skipped 7
Pending/null 60

Benchmark configuration:

benchmark=benchmarks/benchmark.json
repo=third_party/linux
provider=openai-compatible
model=qwen3-coder-30b-a3b-128k-t0
bounded_local_model=true
max_input_tokens=90000
context_window_size=131072
max_tokens=6144
temperature=0.0
api_timeout_secs=1200
review_concurrency=1
review_timeout_seconds=7200
review_max_retries=0
enable_static_bug_seeds=true
enable_targeted_bug_pattern_prescan=false

Add OpenAI-compatible response schema support with provider-specific fallback handling. Keep schema-backed requests strict only when the provider supports the requested format, and account for downgraded schema payloads before posting. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>

Add an explicit OpenAI-compatible bounded local model policy flag. The provider name remains a transport selection only; the new flag defaults to false and must be enabled separately for local model review policy. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>

Use the bounded local policy flag to switch review stages into a Sashiko-controlled loop with local guide caps, bounded exploration, schema-only finalization, prompt preflight margins, typed budget failures, and minimal fallback. Default OpenAI-compatible behavior continues to use the native stage protocol unless bounded_local_model is enabled. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>

Preserve Stage 8 and Stage 9 accounting across finalization, fallback, and repair paths. Ensure retained concerns, dropped candidates, repaired findings, and source concern IDs stay consistent when Stage 9 compacts or repairs model output. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>

Classify skipped, failed, and reviewed benchmark rows with typed terminal statuses. Report reviewed-row denominators and Stage 8/Stage 9 accounting without applying fixture-specific scorer overrides. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>

Document bounded local OpenAI-compatible provider behavior and keep benchmark profile comparisons separate between neutral, generic-static, and regression modes. Expose the review toggles and bounded local model setting in the public configuration docs. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>

rgushchin · 2026-06-09T16:52:29Z

Do you get any useful results out of it? Any benchmark results I can share?
I've mixed feelings about supporting various limited modes: from one perspective I can see how they are useful, but at the same time they increase the testing surface and their long-term value is kinda unclear. To me the decision factor is how actually useful they actually are.

anders-heimer · 2026-06-10T15:55:00Z

Hi Roman,

Yes, I get some useful results, but it is far from frontier-level quality.

The way I think about it is somewhat similar to risc-v, we do not care about it because every risc-v board beats x86/arm systems today.

On the 999-entry benchmark, the setup produced 33 exact detections and 143 strong partials: 176/999 actionable useful signal end-to-end, or 176/640 among valid reviewed reports, using Opus as the judge. The main limitation is still budget/model quality, with 339 budget failures and 7 protocol failures.

Attached the result of the benchmark.

Thanks, Anders
review-comments.html

anders-heimer added 3 commits June 2, 2026 13:18

anders-heimer force-pushed the rfc/bounded-local-model-review branch from 5ce10c4 to 14f3884 Compare June 2, 2026 11:20

anders-heimer mentioned this pull request Jun 2, 2026

Required optimization on local review with claude cli #184

Open

anders-heimer force-pushed the rfc/bounded-local-model-review branch from 14f3884 to 5a60cd9 Compare June 2, 2026 13:21

anders-heimer added 3 commits June 2, 2026 15:34

anders-heimer force-pushed the rfc/bounded-local-model-review branch from 5a60cd9 to 5c1a3be Compare June 2, 2026 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Add bounded local model review#235

RFC: Add bounded local model review#235
anders-heimer wants to merge 6 commits into
sashiko-dev:mainfrom
anders-heimer:rfc/bounded-local-model-review

anders-heimer commented Jun 2, 2026

Uh oh!

rgushchin commented Jun 9, 2026

Uh oh!

anders-heimer commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anders-heimer commented Jun 2, 2026

Uh oh!

rgushchin commented Jun 9, 2026

Uh oh!

anders-heimer commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants