RFC: Add bounded local model review#235
Conversation
Add OpenAI-compatible response schema support with provider-specific fallback handling. Keep schema-backed requests strict only when the provider supports the requested format, and account for downgraded schema payloads before posting. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Add an explicit OpenAI-compatible bounded local model policy flag. The provider name remains a transport selection only; the new flag defaults to false and must be enabled separately for local model review policy. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Use the bounded local policy flag to switch review stages into a Sashiko-controlled loop with local guide caps, bounded exploration, schema-only finalization, prompt preflight margins, typed budget failures, and minimal fallback. Default OpenAI-compatible behavior continues to use the native stage protocol unless bounded_local_model is enabled. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>
5ce10c4 to
14f3884
Compare
14f3884 to
5a60cd9
Compare
Preserve Stage 8 and Stage 9 accounting across finalization, fallback, and repair paths. Ensure retained concerns, dropped candidates, repaired findings, and source concern IDs stay consistent when Stage 9 compacts or repairs model output. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Classify skipped, failed, and reviewed benchmark rows with typed terminal statuses. Report reviewed-row denominators and Stage 8/Stage 9 accounting without applying fixture-specific scorer overrides. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>
Document bounded local OpenAI-compatible provider behavior and keep benchmark profile comparisons separate between neutral, generic-static, and regression modes. Expose the review toggles and bounded local model setting in the public configuration docs. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Anders Heimer <anders.heimer@est.tech>
5a60cd9 to
5c1a3be
Compare
|
Do you get any useful results out of it? Any benchmark results I can share? |
|
Hi Roman, Yes, I get some useful results, but it is far from frontier-level quality. The way I think about it is somewhat similar to risc-v, we do not care about it because every risc-v board beats x86/arm systems today. On the 999-entry benchmark, the setup produced 33 exact detections and 143 strong partials: 176/999 actionable useful signal end-to-end, or 176/640 among valid reviewed reports, using Opus as the judge. The main limitation is still budget/model quality, with 339 budget failures and 7 protocol failures. Attached the result of the benchmark. Thanks, Anders |
This RFC adds a bounded review mode for OpenAI-compatible local models and
tightens the review/benchmark accounting needed to evaluate that mode.
The main goal is to keep the provider selection as a transport choice while
allowing local models to opt into Sashiko-controlled review behavior: bounded
tool exploration, schema-only finalization, token preflight checks, typed budget
failures, and minimal fallback output when the full staged review cannot fit.
The series also preserves Stage 8/Stage 9 accounting through finalization and
benchmark reporting, so retained concerns, dropped candidates, final findings,
and failed/skipped rows are easier to audit.
I am currently running the 1000-patch benchmark on an L40S:
Reviews:
Failed 274
Reviewed 545
Skipped 7
In Review 1
Patch statuses:
Reviewed 545
FailedBudget 271
Failed 3
Skipped 7
Pending/null 60
Benchmark configuration:
benchmark=benchmarks/benchmark.json
repo=third_party/linux
provider=openai-compatible
model=qwen3-coder-30b-a3b-128k-t0
bounded_local_model=true
max_input_tokens=90000
context_window_size=131072
max_tokens=6144
temperature=0.0
api_timeout_secs=1200
review_concurrency=1
review_timeout_seconds=7200
review_max_retries=0
enable_static_bug_seeds=true
enable_targeted_bug_pattern_prescan=false