A startup hook for vLLM that injects a default
thinking_token_budget(andpresence_penalty) for sampling parameters that--override-generation-configdoesn't propagate.
vLLM's --override-generation-config flag claims to set defaults for all sampling parameters, but in practice several fields silently don't propagate to SamplingParams, notably:
thinking_token_budget— needed to cap reasoning length on Qwen3 family, which has a documented infinite-loop issue in thinking mode (truncation rate up to 27.5% on hard problems)presence_penalty— recommended at 1.0–1.5 by the Qwen team to mitigate repetition
This means every client (Open WebUI, Hermes, Claude Code, custom apps) must send these parameters in every request, or risk runaway reasoning that consumes max_tokens (often 32K+ tokens of repeated phrases like "Wait, but actually...").
Cloud providers like Alibaba DashScope handle this server-side via their thinking_budget API parameter. Self-hosted vLLM users have had no equivalent — until now.
Patches SamplingParams.from_optional (the factory used by both /v1/chat/completions and /v1/messages endpoints in vLLM) to inject default values from environment variables when the request doesn't specify them.
Per-request values still take precedence — this only fills gaps.
export VLLM_DEFAULT_THINKING_BUDGET=8192
export VLLM_DEFAULT_PRESENCE_PENALTY=1.0
vllm serve ...Now every request without these parameters set will receive the defaults automatically — across OpenAI and Anthropic API endpoints.
vLLM has an official plugin system via vllm.general_plugins entry point, but plugins load after SamplingParams is imported — too late to patch it.
SamplingParams is a msgspec.Struct, which forbids subclassing with __init__ ("Struct types cannot define __init__") and uses a C-level init path that doesn't trigger Python __post_init__ overrides.
The from_optional static method is the only Python-level entry point that can be reliably patched, and .pth files are the only mechanism that runs early enough in Python's startup to install the patch before vLLM imports SamplingParams.
git clone https://github.com/palmfuture/vllm-default-thinking-budget
cd vllm-default-thinking-budget
./install.sh /path/to/your/vllm/venvThe script auto-detects your venv's site-packages and copies two files there:
vllm_default_thinking_budget.pth— Python startup triggervllm_default_thinking_budget_loader.py— the actual patch
After installing, set environment variables before running vLLM:
export VLLM_DEFAULT_THINKING_BUDGET=8192 # max reasoning tokens
export VLLM_DEFAULT_PRESENCE_PENALTY=1.0 # repetition suppression
vllm serve your-model \
--override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0}' \
--reasoning-parser qwen3 \
--reasoning-config '{"reasoning_start_str": "<think>", "reasoning_end_str": " I need to give the final answer now.</think>"}' \
...You should see this line in vLLM startup logs:
[vllm_default_thinking_budget] patched from_optional: budget=8192, presence=1.0
After startup, send a request without specifying the parameters:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "your-model", "messages": [{"role": "user", "content": "Hi"}]}'Then check vLLM's request log — you should see your defaults applied:
SamplingParams(..., presence_penalty=1.0, ..., thinking_token_budget=8192, ...)
Tested with Qwen3.6-35B-A3B-GPTQ-Int4 on vLLM 0.20.0:
| Workload | Budget | Presence | Notes |
|---|---|---|---|
| Agent / tool calls | 2048–4096 | 1.0 | Low reasoning need, fast response |
| General chat | 8192 | 1.0 | Balanced |
| Coding tasks | 8192–16384 | 1.0 | Includes refactor and debug |
| Math / hard reasoning | 16384–32768 | 1.0 | Approaching official 80K guidance |
Lower budgets reduce loop risk and latency at the cost of accuracy on hard tasks. The <think> end_str injection (via vLLM's --reasoning-config) ensures the model produces a coherent final answer when the budget is exceeded.
./uninstall.sh /path/to/your/vllm/venvOr manually:
rm /path/to/venv/lib/python*/site-packages/vllm_default_thinking_budget.pth
rm /path/to/venv/lib/python*/site-packages/vllm_default_thinking_budget_loader.py- Patches only
from_optional. Other code paths that bypass it won't be affected — but none have been observed in vLLM 0.20.0's OpenAI or Anthropic adapters. - vLLM API changes may break the patch — pin to vLLM versions you've tested. Verified on vLLM 0.20.0 with msgspec 0.18.x.
- Does not handle
frequency_penalty,repetition_penalty, or other sampling fields. If you need those, send them per-request.
I quantize and publish Qwen3.x models on Hugging Face under palmfuture. Users kept reporting infinite loops in <think> blocks despite documented presence_penalty recommendations, because their clients didn't always send the right sampling parameters. After verifying that --override-generation-config silently drops these fields and that msgspec blocks normal monkey-patching, the .pth approach turned out to be the only reliable fix that doesn't require modifying vLLM source or adding a network proxy.
If you find this useful, leave a star or open an issue with feedback.
- vLLM issue #28070 — root cause:
--override-generation-configdoesn't work for reasoning fields - Qwen3.6 issue #88 — the infinite-loop problem
- Alibaba DashScope
thinking_budget— the equivalent server-side parameter on Alibaba Cloud
MIT