Skip to content

palmfuture/vllm-default-thinking-budget

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vllm-default-thinking-budget

A startup hook for vLLM that injects a default thinking_token_budget (and presence_penalty) for sampling parameters that --override-generation-config doesn't propagate.

License: MIT

The Problem

vLLM's --override-generation-config flag claims to set defaults for all sampling parameters, but in practice several fields silently don't propagate to SamplingParams, notably:

  • thinking_token_budget — needed to cap reasoning length on Qwen3 family, which has a documented infinite-loop issue in thinking mode (truncation rate up to 27.5% on hard problems)
  • presence_penalty — recommended at 1.0–1.5 by the Qwen team to mitigate repetition

This means every client (Open WebUI, Hermes, Claude Code, custom apps) must send these parameters in every request, or risk runaway reasoning that consumes max_tokens (often 32K+ tokens of repeated phrases like "Wait, but actually...").

Cloud providers like Alibaba DashScope handle this server-side via their thinking_budget API parameter. Self-hosted vLLM users have had no equivalent — until now.

What This Does

Patches SamplingParams.from_optional (the factory used by both /v1/chat/completions and /v1/messages endpoints in vLLM) to inject default values from environment variables when the request doesn't specify them.

Per-request values still take precedence — this only fills gaps.

export VLLM_DEFAULT_THINKING_BUDGET=8192
export VLLM_DEFAULT_PRESENCE_PENALTY=1.0
vllm serve ...

Now every request without these parameters set will receive the defaults automatically — across OpenAI and Anthropic API endpoints.

Why It's Not a Normal vLLM Plugin

vLLM has an official plugin system via vllm.general_plugins entry point, but plugins load after SamplingParams is imported — too late to patch it.

SamplingParams is a msgspec.Struct, which forbids subclassing with __init__ ("Struct types cannot define __init__") and uses a C-level init path that doesn't trigger Python __post_init__ overrides.

The from_optional static method is the only Python-level entry point that can be reliably patched, and .pth files are the only mechanism that runs early enough in Python's startup to install the patch before vLLM imports SamplingParams.

Install

git clone https://github.com/palmfuture/vllm-default-thinking-budget
cd vllm-default-thinking-budget
./install.sh /path/to/your/vllm/venv

The script auto-detects your venv's site-packages and copies two files there:

  • vllm_default_thinking_budget.pth — Python startup trigger
  • vllm_default_thinking_budget_loader.py — the actual patch

Usage

After installing, set environment variables before running vLLM:

export VLLM_DEFAULT_THINKING_BUDGET=8192      # max reasoning tokens
export VLLM_DEFAULT_PRESENCE_PENALTY=1.0      # repetition suppression

vllm serve your-model \
  --override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0}' \
  --reasoning-parser qwen3 \
  --reasoning-config '{"reasoning_start_str": "<think>", "reasoning_end_str": " I need to give the final answer now.</think>"}' \
  ...

You should see this line in vLLM startup logs:

[vllm_default_thinking_budget] patched from_optional: budget=8192, presence=1.0

Verify it works

After startup, send a request without specifying the parameters:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "your-model", "messages": [{"role": "user", "content": "Hi"}]}'

Then check vLLM's request log — you should see your defaults applied:

SamplingParams(..., presence_penalty=1.0, ..., thinking_token_budget=8192, ...)

Recommended Defaults for Qwen3.5/3.6 Family

Tested with Qwen3.6-35B-A3B-GPTQ-Int4 on vLLM 0.20.0:

Workload Budget Presence Notes
Agent / tool calls 2048–4096 1.0 Low reasoning need, fast response
General chat 8192 1.0 Balanced
Coding tasks 8192–16384 1.0 Includes refactor and debug
Math / hard reasoning 16384–32768 1.0 Approaching official 80K guidance

Lower budgets reduce loop risk and latency at the cost of accuracy on hard tasks. The <think> end_str injection (via vLLM's --reasoning-config) ensures the model produces a coherent final answer when the budget is exceeded.

Uninstall

./uninstall.sh /path/to/your/vllm/venv

Or manually:

rm /path/to/venv/lib/python*/site-packages/vllm_default_thinking_budget.pth
rm /path/to/venv/lib/python*/site-packages/vllm_default_thinking_budget_loader.py

Limitations

  • Patches only from_optional. Other code paths that bypass it won't be affected — but none have been observed in vLLM 0.20.0's OpenAI or Anthropic adapters.
  • vLLM API changes may break the patch — pin to vLLM versions you've tested. Verified on vLLM 0.20.0 with msgspec 0.18.x.
  • Does not handle frequency_penalty, repetition_penalty, or other sampling fields. If you need those, send them per-request.

Why I Built This

I quantize and publish Qwen3.x models on Hugging Face under palmfuture. Users kept reporting infinite loops in <think> blocks despite documented presence_penalty recommendations, because their clients didn't always send the right sampling parameters. After verifying that --override-generation-config silently drops these fields and that msgspec blocks normal monkey-patching, the .pth approach turned out to be the only reliable fix that doesn't require modifying vLLM source or adding a network proxy.

If you find this useful, leave a star or open an issue with feedback.

See Also

License

MIT

About

Inject default thinking_token_budget and presence_penalty for vLLM, fixing the gap where --override-generation-config doesn't propagate these fields. Prevents Qwen3 thinking-mode infinite loops.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors