Skip to content

feat: add Qwen3-8B benchmark scripts for ALFWorld and Search#56

Open
yuaofan0-oss wants to merge 24 commits into
aiming-lab:mainfrom
yuaofan0-oss:claude/qwen3-skillrl-benchmarks-cQ7xV
Open

feat: add Qwen3-8B benchmark scripts for ALFWorld and Search#56
yuaofan0-oss wants to merge 24 commits into
aiming-lab:mainfrom
yuaofan0-oss:claude/qwen3-skillrl-benchmarks-cQ7xV

Conversation

@yuaofan0-oss
Copy link
Copy Markdown

Add GRPO training/eval scripts for Qwen3-8B on the ALFWorld and Search benchmarks with skill memory. Key differences from Qwen2.5-7B scripts:

  • Set MODEL_PATH default to Qwen/Qwen3-8B
  • Enable trust_remote_code for Qwen3 architecture
  • Disable thinking mode via override_config.enable_thinking=false to keep action outputs concise for interactive environments

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Add GRPO training/eval scripts for Qwen3-8B on the ALFWorld and Search
benchmarks with skill memory. Key differences from Qwen2.5-7B scripts:
- Set MODEL_PATH default to Qwen/Qwen3-8B
- Enable trust_remote_code for Qwen3 architecture
- Disable thinking mode via override_config.enable_thinking=false
  to keep action outputs concise for interactive environments

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
@richard-peng-xia
Copy link
Copy Markdown
Contributor

Hi @yuaofan0-oss,

Thanks for your contribution.

Did you ever run these scripts? I have a little concern on potential conflicts between the VeRL version and other packages.

claude added 23 commits May 10, 2026 16:23
- examples/sft/qwen3_8b/: LLaMA-Factory full fine-tune configs for
  ALFWorld and Search benchmarks using SkillRL-SFT-Data; batch_size=1
  with gradient_accumulation=16 and gradient_checkpointing to fit 8×H100
- run_sft_{alfworld,search}.sh: torchrun launchers with correct env vars
  (ALFWORLD_DATA, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True)
- run_alfworld_qwen3_8b.sh: remove enable_thinking=false, bump
  max_response_length 512→1024 to accommodate <think> blocks
- run_search_qwen3_8b.sh: remove enable_thinking=false

Qwen3-8B SFT data already contains <think> blocks; disabling thinking
mode would corrupt token prediction targets.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
- lr: 1e-5 → 2e-5 (aligned with SkillRL paper full-finetune SFT)
- per_device_train_batch_size: 1→2, gradient_accumulation: 16→8
  (effective batch stays 128 = 8 GPUs × 2 × 8 grad_accum)
- deepspeed: ds_z3_config.json baked into YAML (no manual echo needed)
- save_steps: 500→100 for finer checkpoint recovery
- run_sft_*.sh: nohup + tee to timestamped log file, PID saved to
  logs/sft/*.pid, WANDB_MODE=offline to avoid network hang

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
SFT YAML:
- lr: 1e-4 (paper exact value, not 2e-5)
- effective batch: 16 = 8 GPUs × 2 per-device × 1 grad_accum
- ALFWorld: ~7500 examples / Search: ~2400 examples, 3 epochs each

RL scripts:
- max_prompt_length: 4096→6000 (alfworld), 5000→6000 (search)
- max_response_length: 700→1024 (search, matches paper Table 4)

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
- gpu_memory_utilization=0.5 for ALFWorld (was 0.6, caused OOM)
- gpu_memory_utilization=0.4 for Search (unchanged)
- Greedy decoding: temperature=0, do_sample=False
- ALFWorld seen: eval_in_distribution
- ALFWorld unseen: eval_out_of_distribution
- Search uses retrieval at http://127.0.0.1:8000/retrieve
- total_epochs=0 + val_before_train=True = pure eval mode

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
Trains a single unified checkpoint from both benchmarks combined,
matching the SkillLora mixed-training approach. Uses LLaMA-Factory
comma-separated dataset syntax to merge skillrl_alfworld_sft and
skillrl_search_sft automatically.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
Without shift, extra args like trainer.n_gpus_per_node=1 were
being consumed as ENGINE value instead of passed to python.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
free_cache_engine=False uses vLLM cumem sleep/wake which conflicts
with FSDP all-gather, causing OOM. free_cache_engine=True releases
KV cache between steps, safe for single-pass eval.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
Aligned with SkillRL paper: lr=1e-6, 150 epochs, group_size=8,
max_steps=50, dynamic skill memory update enabled. Fixes data path
to use absolute inspire path and removes prepare step that requires
internet access.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
verl asserts that CUDA graphs must be disabled when using
free_cache_engine=True. Set enforce_eager=True in all eval scripts.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
Both scripts start from qwen3-8b-sft-mixed checkpoint and use
paper-aligned hyperparameters (lr=1e-6, group_size=8/4, 150/100 epochs).
Data paths hardcoded to shared /inspire/ filesystem.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
Concurrent requests caused GPU contention (1600%+ CPU), making each
query effectively take 60s+. Semaphore ensures one query at a time,
eliminating contention. Requests queue up server-side instead of
timing out client-side.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
…he_engine=True

Switch train dataset to train_small.parquet (256 rows), reduce val_batch to 64,
group_size to 2, total_epochs to 20 (20 total steps). Enable enforce_eager and
free_cache_engine to avoid cumem OOM. Enable param_offload to reduce VRAM pressure.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
…he_engine

Dynamic skill update requires Azure OpenAI credentials which are not available.
Disable it for RL training (static skills from JSON are still used).
Also enable enforce_eager=True + free_cache_engine=True to prevent cumem OOM.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
…h requests

With 256 batch x 2 group_size = 512 concurrent env workers all calling FAISS
simultaneously → CPU thrash at 1301% → all timeout. Reduce to 16x1 = 16
concurrent workers so semaphore-serialized searches stay manageable.
Use full train.parquet (batch=16 gives ~10600 steps/epoch, set epochs=1).

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
…arch

The training client sends {"queries": [...]} (plural list) but the server
expected {"query": "..."} (singular string) — causing 422 errors silently
treated as timeouts. Fix by accepting queries: List[str] and using
retriever.batch_search() which encodes all queries in one GPU pass and
does a single batched FAISS search. This is both correct AND fast:
16 queries searched in ~same time as 1 query.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
…) formats

skyrl_gym/tools/search.py sends {"query": str} while verl/tools/search_tool.py
sends {"queries": List[str]}. Accept both to avoid 422 errors from skyrl tool.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
With train_batch_size=16 and n_gpus=4:
size_divisor = log_prob_micro_batch_size_per_gpu * 4 = 32*4 = 128 > 16
→ adjust_batch tries to copy 112 samples from 16 → ValueError.
Fix: set log_prob_micro_batch_size_per_gpu=4 → size_divisor=16=batch_size.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
- Add env.search.timeout=300 (was default 60s, too short for serialized FAISS)
- Increase semaphore from 1 to 4: with 16 env workers x 14s each,
  semaphore=1 takes 224s (all timeout), semaphore=4 takes ~56s (within 300s)

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
index_cpu_to_all_gpus tries to pre-allocate 32GB as contiguous temp memory
which fails even on H100 80GB. Use index_cpu_to_gpu with setTempMemory(0)
to let CUDA manage allocations directly.

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants