CycliST: A Benchmark for Visual Reasoning on Cyclical State Transitions

This is the official GitHub repository for the paper CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions, Simon Kohaut, Daniel Ochs, Shun Zhang, Benedict Flade, Julian Eggert, Kristian Kersting, and Devendra Singh Dhami, DMLR 2026.

CycliST is a synthetic video benchmark for evaluating vision-language models (VLMs) on cyclical state transitions. It generates 3D-rendered scenes with cyclic motion patterns (orbits, linear motion, rotations, attribute changes), produces structured question-answer pairs about those cycles, and evaluates VLMs on these.

The pipeline has four stages:

Scene Rendering  →  Question Generation  →  VLM Inference  →  LLM Judging
 (Blender)           (question_engine)       (eval_vqa)       (judge_vqa)

Installation

Prerequisites

Python 3.10
CUDA-capable GPU for local model inference
Blender 4.0 for scene rendering (not needed for inference/judging only)

Python environment

Create and activate the conda environment (sets up Python 3.10):

conda env create -f environment.yml
conda activate cycle_env

Install all Python packages — core deps, PyTorch with CUDA 12.4, LLaVA-NeXT, and the full inference stack:

bash install_packages.sh

If the machine injects system-level packages that conflict with the conda env (run once, then reactivate):

conda env config vars set PYTHONNOUSERSITE=1 -n cycle_env
conda activate cycle_env

Blender for scene generation

Install Blender 4.0 and register the cyclist package with Blender's bundled Python:

sudo bash install_blender.sh

This downloads Blender 4.0.1 to /opt/blender, installs its system dependencies, adds it to PATH, and writes a .pth file so Blender's Python can import cyclist. Run it from the repo root so the path registration points to the right directory.

LLM judge server (SGLang)

The judge runs a local SGLang server and requires a separate conda environment to avoid dependency conflicts with the eval stack:

conda create -n cycle_env_sglang python=3.10 -y
conda install -n cycle_env_sglang -c conda-forge gcc=12 gxx=12 -y   # C++20 needed for JIT kernels
conda activate cycle_env_sglang
PYTHONNOUSERSITE=1 bash install_sg_lang.sh

API keys

Create a .env file in the repo root with the keys you need:

HF_HOME=/path/to/huggingface/cache
HUGGINGFACE_HUB_CACHE=/path/to/huggingface/cache
HF_TOKEN=your_huggingface_token          # required to download gated models (Llama judge)
GOOGLE_API_KEY=your_google_api_key       # required for Gemini models
TOKENIZERS_PARALLELISM=false

Usage

1 — Scene Rendering

Render scenes using the provided shell scripts. Each script generates train/test/val splits (300/150/150 videos).

# Single-cycle scenes (one object with one cycle type)
bash scripts/scene/unicycle.sh

# Two-cycle scenes
bash scripts/scene/bicycle.sh

# Three-cycle scenes
bash scripts/scene/tricycle.sh

# Single-cycle scenes with cluttered background (4–9 extra objects)
bash scripts/scene/unicycle_cluttered.sh

Output is written to output/scenes/ (scene JSONs) and output/videos/ (rendered MP4s).

Requires Blender 4.0. The code was run on linux and installation might change for other OS.

2 — Question Generation

Generate question-answer pairs from the rendered scenes:

bash scripts/questions/question_gen.sh

This runs all 12 question templates across the scene output. Output is written to output/questions/.

To inspect generated questions interactively, use cyclist/questions/question_inspection.ipynb.

3 — VLM Inference

Run VLM inference on the generated questions. Models are selected by number (1–9):

#	Model
1	`lmms-lab/LLaVA-Video-7B-Qwen2`
2	`llava-hf/llava-onevision-qwen2-7b-ov-chat-hf`
3	`OpenGVLab/InternVideo2_5_Chat_8B`
4	`gemini-2.0-flash`
5	`gemini-2.5-flash`
6	`lmms-lab/LLaVA-Video-72B-Qwen2`
7	`llava-hf/llava-onevision-qwen2-72b-ov-chat-hf`
8	`OpenGVLab/InternVL3-8B`
9	`OpenGVLab/InternVL3-78B`

# VQA inference (model 1 = LLaVA-Video-7B, dataset = unicycle, fps = 8)
bash scripts/eval/eval_vqa.sh --model 1 --dataset unicycle --fps 8

# Scene understanding inference
bash scripts/eval/eval_scene_understanding.sh --model 1 --dataset unicycle --fps 8

For Gemini models, upload videos first:

bash scripts/eval/upload_gemini.sh --dataset unicycle

Answers are written to output/eval/answers/.

4 — LLM Judging

Judging uses a local SGLang server running meta-llama/Meta-Llama-3-70B-Instruct.

Start the SGLang server in a separate terminal before running the judge (loads meta-llama/Meta-Llama-3-70B-Instruct on port 30000):

bash start_sg_lang.sh

Initialise the request offset tracker once per run:

echo '{"offset": 0}' > offset.json

Run the judge from the repo root (requires cycle_env):

# Judge VQA answers
PYTHONNOUSERSITE=1 python -m cyclist.judge.judge_vqa

# Judge scene understanding captions
PYTHONNOUSERSITE=1 python -m cyclist.judge.judge_scene_understanding

Answers are read from output/eval/answers/ and metrics are written as *_metrics.json files alongside each answer CSV. Use the companion notebooks in cyclist/judge/ to generate results tables and LaTeX output:

vqa_results_counting.ipynb
vqa_results_descriptive.ipynb
vqa_results_representative.ipynb
vqa_results_scene_understanding.ipynb

Citation

@article{kohaut2026cyclist,
      title={CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions},
      author={Simon Kohaut and Daniel Ochs and Shun Zhang and Benedict Flade and Julian Eggert and Kristian Kersting and Devendra Singh Dhami},
      journal={Journal of Data-centric Machine Learning Research},
      year={2026},
      url={https://openreview.net/forum?id=l03g53HUL2},
}

Acknowledgments

CycliST builds on ideas from CLEVR.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
cyclist		cyclist
docs		docs
output		output
scripts		scripts
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml
install_blender.sh		install_blender.sh
install_packages.sh		install_packages.sh
install_sg_lang.sh		install_sg_lang.sh
pyproject.toml		pyproject.toml
start_sg_lang.sh		start_sg_lang.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CycliST: A Benchmark for Visual Reasoning on Cyclical State Transitions

Installation

Prerequisites

Python environment

Blender for scene generation

LLM judge server (SGLang)

API keys

Usage

1 — Scene Rendering

2 — Question Generation

3 — VLM Inference

4 — LLM Judging

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CycliST: A Benchmark for Visual Reasoning on Cyclical State Transitions

Installation

Prerequisites

Python environment

Blender for scene generation

LLM judge server (SGLang)

API keys

Usage

1 — Scene Rendering

2 — Question Generation

3 — VLM Inference

4 — LLM Judging

Citation

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages