feat: add Qwen3-8B benchmark scripts for ALFWorld and Search by yuaofan0-oss · Pull Request #56 · aiming-lab/SkillRL

yuaofan0-oss · 2026-05-10T14:51:41Z

Add GRPO training/eval scripts for Qwen3-8B on the ALFWorld and Search benchmarks with skill memory. Key differences from Qwen2.5-7B scripts:

Set MODEL_PATH default to Qwen/Qwen3-8B
Enable trust_remote_code for Qwen3 architecture
Disable thinking mode via override_config.enable_thinking=false to keep action outputs concise for interactive environments

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Add GRPO training/eval scripts for Qwen3-8B on the ALFWorld and Search benchmarks with skill memory. Key differences from Qwen2.5-7B scripts: - Set MODEL_PATH default to Qwen/Qwen3-8B - Enable trust_remote_code for Qwen3 architecture - Disable thinking mode via override_config.enable_thinking=false to keep action outputs concise for interactive environments https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

richard-peng-xia · 2026-05-10T15:10:07Z

Hi @yuaofan0-oss,

Thanks for your contribution.

Did you ever run these scripts? I have a little concern on potential conflicts between the VeRL version and other packages.

- examples/sft/qwen3_8b/: LLaMA-Factory full fine-tune configs for ALFWorld and Search benchmarks using SkillRL-SFT-Data; batch_size=1 with gradient_accumulation=16 and gradient_checkpointing to fit 8×H100 - run_sft_{alfworld,search}.sh: torchrun launchers with correct env vars (ALFWORLD_DATA, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True) - run_alfworld_qwen3_8b.sh: remove enable_thinking=false, bump max_response_length 512→1024 to accommodate <think> blocks - run_search_qwen3_8b.sh: remove enable_thinking=false Qwen3-8B SFT data already contains <think> blocks; disabling thinking mode would corrupt token prediction targets. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

- lr: 1e-5 → 2e-5 (aligned with SkillRL paper full-finetune SFT) - per_device_train_batch_size: 1→2, gradient_accumulation: 16→8 (effective batch stays 128 = 8 GPUs × 2 × 8 grad_accum) - deepspeed: ds_z3_config.json baked into YAML (no manual echo needed) - save_steps: 500→100 for finer checkpoint recovery - run_sft_*.sh: nohup + tee to timestamped log file, PID saved to logs/sft/*.pid, WANDB_MODE=offline to avoid network hang https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

SFT YAML: - lr: 1e-4 (paper exact value, not 2e-5) - effective batch: 16 = 8 GPUs × 2 per-device × 1 grad_accum - ALFWorld: ~7500 examples / Search: ~2400 examples, 3 epochs each RL scripts: - max_prompt_length: 4096→6000 (alfworld), 5000→6000 (search) - max_response_length: 700→1024 (search, matches paper Table 4) https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

- gpu_memory_utilization=0.5 for ALFWorld (was 0.6, caused OOM) - gpu_memory_utilization=0.4 for Search (unchanged) - Greedy decoding: temperature=0, do_sample=False - ALFWorld seen: eval_in_distribution - ALFWorld unseen: eval_out_of_distribution - Search uses retrieval at http://127.0.0.1:8000/retrieve - total_epochs=0 + val_before_train=True = pure eval mode https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Trains a single unified checkpoint from both benchmarks combined, matching the SkillLora mixed-training approach. Uses LLaMA-Factory comma-separated dataset syntax to merge skillrl_alfworld_sft and skillrl_search_sft automatically. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Without shift, extra args like trainer.n_gpus_per_node=1 were being consumed as ENGINE value instead of passed to python. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

free_cache_engine=False uses vLLM cumem sleep/wake which conflicts with FSDP all-gather, causing OOM. free_cache_engine=True releases KV cache between steps, safe for single-pass eval. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Aligned with SkillRL paper: lr=1e-6, 150 epochs, group_size=8, max_steps=50, dynamic skill memory update enabled. Fixes data path to use absolute inspire path and removes prepare step that requires internet access. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Same cumem wake_up OOM fix as ALFWorld eval scripts. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

verl asserts that CUDA graphs must be disabled when using free_cache_engine=True. Set enforce_eager=True in all eval scripts. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Both scripts start from qwen3-8b-sft-mixed checkpoint and use paper-aligned hyperparameters (lr=1e-6, group_size=8/4, 150/100 epochs). Data paths hardcoded to shared /inspire/ filesystem. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Concurrent requests caused GPU contention (1600%+ CPU), making each query effectively take 60s+. Semaphore ensures one query at a time, eliminating contention. Requests queue up server-side instead of timing out client-side. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

…he_engine=True Switch train dataset to train_small.parquet (256 rows), reduce val_batch to 64, group_size to 2, total_epochs to 20 (20 total steps). Enable enforce_eager and free_cache_engine to avoid cumem OOM. Enable param_offload to reduce VRAM pressure. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

…he_engine Dynamic skill update requires Azure OpenAI credentials which are not available. Disable it for RL training (static skills from JSON are still used). Also enable enforce_eager=True + free_cache_engine=True to prevent cumem OOM. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

…h requests With 256 batch x 2 group_size = 512 concurrent env workers all calling FAISS simultaneously → CPU thrash at 1301% → all timeout. Reduce to 16x1 = 16 concurrent workers so semaphore-serialized searches stay manageable. Use full train.parquet (batch=16 gives ~10600 steps/epoch, set epochs=1). https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

…arch The training client sends {"queries": [...]} (plural list) but the server expected {"query": "..."} (singular string) — causing 422 errors silently treated as timeouts. Fix by accepting queries: List[str] and using retriever.batch_search() which encodes all queries in one GPU pass and does a single batched FAISS search. This is both correct AND fast: 16 queries searched in ~same time as 1 query. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

…) formats skyrl_gym/tools/search.py sends {"query": str} while verl/tools/search_tool.py sends {"queries": List[str]}. Accept both to avoid 422 errors from skyrl tool. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

With train_batch_size=16 and n_gpus=4: size_divisor = log_prob_micro_batch_size_per_gpu * 4 = 32*4 = 128 > 16 → adjust_batch tries to copy 112 samples from 16 → ValueError. Fix: set log_prob_micro_batch_size_per_gpu=4 → size_divisor=16=batch_size. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

- Add env.search.timeout=300 (was default 60s, too short for serialized FAISS) - Increase semaphore from 1 to 4: with 16 env workers x 14s each, semaphore=1 takes 224s (all timeout), semaphore=4 takes ~56s (within 300s) https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

index_cpu_to_all_gpus tries to pre-allocate 32GB as contiguous temp memory which fails even on H100 80GB. Use index_cpu_to_gpu with setTempMemory(0) to let CUDA manage allocations directly. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

claude added 23 commits May 10, 2026 16:23

Fix eval scripts: add shift after ENGINE arg

2a95548

Without shift, extra args like trainer.n_gpus_per_node=1 were being consumed as ENGINE value instead of passed to python. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Fix search eval OOM: free_cache_engine=True, param_offload=True

d06f420

Same cumem wake_up OOM fix as ALFWorld eval scripts. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Fix eval: enforce_eager=True required when free_cache_engine=True

7ce35c4

verl asserts that CUDA graphs must be disabled when using free_cache_engine=True. Set enforce_eager=True in all eval scripts. https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Search RL: set total_epochs=3 (batch=16, ~31800 total steps)

2a6cca4

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Search RL: use 2 GPUs, freeing cuda7 for GPU FAISS retrieval server

aed97e4

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

ALFWorld RL: use 8 GPUs (n_gpus_per_node=4 -> 8)

619a632

https://claude.ai/code/session_01T8mX6Wn2MDJXXkxhZxS1yX

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Qwen3-8B benchmark scripts for ALFWorld and Search#56

feat: add Qwen3-8B benchmark scripts for ALFWorld and Search#56
yuaofan0-oss wants to merge 24 commits into
aiming-lab:mainfrom
yuaofan0-oss:claude/qwen3-skillrl-benchmarks-cQ7xV

yuaofan0-oss commented May 10, 2026

Uh oh!

richard-peng-xia commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuaofan0-oss commented May 10, 2026

Uh oh!

richard-peng-xia commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants