vllm-default-thinking-budget

A startup hook for vLLM that injects a default thinking_token_budget (and presence_penalty) for sampling parameters that --override-generation-config doesn't propagate.

The Problem

vLLM's --override-generation-config flag claims to set defaults for all sampling parameters, but in practice several fields silently don't propagate to SamplingParams, notably:

thinking_token_budget — needed to cap reasoning length on Qwen3 family, which has a documented infinite-loop issue in thinking mode (truncation rate up to 27.5% on hard problems)
presence_penalty — recommended at 1.0–1.5 by the Qwen team to mitigate repetition

This means every client (Open WebUI, Hermes, Claude Code, custom apps) must send these parameters in every request, or risk runaway reasoning that consumes max_tokens (often 32K+ tokens of repeated phrases like "Wait, but actually...").

Cloud providers like Alibaba DashScope handle this server-side via their thinking_budget API parameter. Self-hosted vLLM users have had no equivalent — until now.

What This Does

Patches SamplingParams.from_optional (the factory used by both /v1/chat/completions and /v1/messages endpoints in vLLM) to inject default values from environment variables when the request doesn't specify them.

Per-request values still take precedence — this only fills gaps.

export VLLM_DEFAULT_THINKING_BUDGET=8192
export VLLM_DEFAULT_PRESENCE_PENALTY=1.0
vllm serve ...

Now every request without these parameters set will receive the defaults automatically — across OpenAI and Anthropic API endpoints.

Why It's Not a Normal vLLM Plugin

vLLM has an official plugin system via vllm.general_plugins entry point, but plugins load after SamplingParams is imported — too late to patch it.

SamplingParams is a msgspec.Struct, which forbids subclassing with __init__ ("Struct types cannot define __init__") and uses a C-level init path that doesn't trigger Python __post_init__ overrides.

The from_optional static method is the only Python-level entry point that can be reliably patched, and .pth files are the only mechanism that runs early enough in Python's startup to install the patch before vLLM imports SamplingParams.

Install

git clone https://github.com/palmfuture/vllm-default-thinking-budget
cd vllm-default-thinking-budget
./install.sh /path/to/your/vllm/venv

The script auto-detects your venv's site-packages and copies two files there:

vllm_default_thinking_budget.pth — Python startup trigger
vllm_default_thinking_budget_loader.py — the actual patch

Usage

After installing, set environment variables before running vLLM:

export VLLM_DEFAULT_THINKING_BUDGET=8192      # max reasoning tokens
export VLLM_DEFAULT_PRESENCE_PENALTY=1.0      # repetition suppression

vllm serve your-model \
  --override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0}' \
  --reasoning-parser qwen3 \
  --reasoning-config '{"reasoning_start_str": "<think>", "reasoning_end_str": " I need to give the final answer now.</think>"}' \
  ...

You should see this line in vLLM startup logs:

[vllm_default_thinking_budget] patched from_optional: budget=8192, presence=1.0

Verify it works

After startup, send a request without specifying the parameters:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "your-model", "messages": [{"role": "user", "content": "Hi"}]}'

Then check vLLM's request log — you should see your defaults applied:

SamplingParams(..., presence_penalty=1.0, ..., thinking_token_budget=8192, ...)

Recommended Defaults for Qwen3.5/3.6 Family

Tested with Qwen3.6-35B-A3B-GPTQ-Int4 on vLLM 0.20.0:

Workload	Budget	Presence	Notes
Agent / tool calls	2048–4096	1.0	Low reasoning need, fast response
General chat	8192	1.0	Balanced
Coding tasks	8192–16384	1.0	Includes refactor and debug
Math / hard reasoning	16384–32768	1.0	Approaching official 80K guidance

Lower budgets reduce loop risk and latency at the cost of accuracy on hard tasks. The <think> end_str injection (via vLLM's --reasoning-config) ensures the model produces a coherent final answer when the budget is exceeded.

Uninstall

./uninstall.sh /path/to/your/vllm/venv

Or manually:

rm /path/to/venv/lib/python*/site-packages/vllm_default_thinking_budget.pth
rm /path/to/venv/lib/python*/site-packages/vllm_default_thinking_budget_loader.py

Limitations

Patches only from_optional. Other code paths that bypass it won't be affected — but none have been observed in vLLM 0.20.0's OpenAI or Anthropic adapters.
vLLM API changes may break the patch — pin to vLLM versions you've tested. Verified on vLLM 0.20.0 with msgspec 0.18.x.
Does not handle frequency_penalty, repetition_penalty, or other sampling fields. If you need those, send them per-request.

Why I Built This

I quantize and publish Qwen3.x models on Hugging Face under palmfuture. Users kept reporting infinite loops in <think> blocks despite documented presence_penalty recommendations, because their clients didn't always send the right sampling parameters. After verifying that --override-generation-config silently drops these fields and that msgspec blocks normal monkey-patching, the .pth approach turned out to be the only reliable fix that doesn't require modifying vLLM source or adding a network proxy.

If you find this useful, leave a star or open an issue with feedback.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
uninstall.sh		uninstall.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vllm-default-thinking-budget

The Problem

What This Does

Why It's Not a Normal vLLM Plugin

Install

Usage

Verify it works

Recommended Defaults for Qwen3.5/3.6 Family

Uninstall

Limitations

Why I Built This

See Also

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vllm-default-thinking-budget

The Problem

What This Does

Why It's Not a Normal vLLM Plugin

Install

Usage

Verify it works

Recommended Defaults for Qwen3.5/3.6 Family

Uninstall

Limitations

Why I Built This

See Also

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages