Skip to content

RFC: Add bounded local model review#235

Draft
anders-heimer wants to merge 6 commits into
sashiko-dev:mainfrom
anders-heimer:rfc/bounded-local-model-review
Draft

RFC: Add bounded local model review#235
anders-heimer wants to merge 6 commits into
sashiko-dev:mainfrom
anders-heimer:rfc/bounded-local-model-review

Conversation

@anders-heimer

Copy link
Copy Markdown
Contributor

This RFC adds a bounded review mode for OpenAI-compatible local models and
tightens the review/benchmark accounting needed to evaluate that mode.

The main goal is to keep the provider selection as a transport choice while
allowing local models to opt into Sashiko-controlled review behavior: bounded
tool exploration, schema-only finalization, token preflight checks, typed budget
failures, and minimal fallback output when the full staged review cannot fit.

The series also preserves Stage 8/Stage 9 accounting through finalization and
benchmark reporting, so retained concerns, dropped candidates, final findings,
and failed/skipped rows are easier to audit.

I am currently running the 1000-patch benchmark on an L40S:

Reviews:
Failed 274
Reviewed 545
Skipped 7
In Review 1

Patch statuses:
Reviewed 545
FailedBudget 271
Failed 3
Skipped 7
Pending/null 60

Benchmark configuration:

benchmark=benchmarks/benchmark.json
repo=third_party/linux
provider=openai-compatible
model=qwen3-coder-30b-a3b-128k-t0
bounded_local_model=true
max_input_tokens=90000
context_window_size=131072
max_tokens=6144
temperature=0.0
api_timeout_secs=1200
review_concurrency=1
review_timeout_seconds=7200
review_max_retries=0
enable_static_bug_seeds=true
enable_targeted_bug_pattern_prescan=false

Add OpenAI-compatible response schema support with provider-specific
fallback handling.

Keep schema-backed requests strict only when the provider supports the
requested format, and account for downgraded schema payloads before
posting.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Add an explicit OpenAI-compatible bounded local model policy flag.

The provider name remains a transport selection only; the new flag
defaults to false and must be enabled separately for local model review
policy.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Use the bounded local policy flag to switch review stages into a
Sashiko-controlled loop with local guide caps, bounded exploration,
schema-only finalization, prompt preflight margins, typed budget
failures, and minimal fallback.

Default OpenAI-compatible behavior continues to use the native stage
protocol unless bounded_local_model is enabled.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Preserve Stage 8 and Stage 9 accounting across finalization, fallback,
and repair paths.

Ensure retained concerns, dropped candidates, repaired findings, and
source concern IDs stay consistent when Stage 9 compacts or repairs
model output.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Classify skipped, failed, and reviewed benchmark rows with typed
terminal statuses.

Report reviewed-row denominators and Stage 8/Stage 9 accounting without
applying fixture-specific scorer overrides.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Document bounded local OpenAI-compatible provider behavior and keep
benchmark profile comparisons separate between neutral, generic-static,
and regression modes.

Expose the review toggles and bounded local model setting in the public
configuration docs.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Anders Heimer <anders.heimer@est.tech>
@anders-heimer anders-heimer force-pushed the rfc/bounded-local-model-review branch from 5a60cd9 to 5c1a3be Compare June 2, 2026 13:38
@rgushchin

Copy link
Copy Markdown
Member

Do you get any useful results out of it? Any benchmark results I can share?
I've mixed feelings about supporting various limited modes: from one perspective I can see how they are useful, but at the same time they increase the testing surface and their long-term value is kinda unclear. To me the decision factor is how actually useful they actually are.

@anders-heimer

Copy link
Copy Markdown
Contributor Author

Hi Roman,

Yes, I get some useful results, but it is far from frontier-level quality.

The way I think about it is somewhat similar to risc-v, we do not care about it because every risc-v board beats x86/arm systems today.

On the 999-entry benchmark, the setup produced 33 exact detections and 143 strong partials: 176/999 actionable useful signal end-to-end, or 176/640 among valid reviewed reports, using Opus as the judge. The main limitation is still budget/model quality, with 339 budget failures and 7 protocol failures.

Attached the result of the benchmark.

Thanks, Anders
review-comments.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants