Sohwi Lim · Lee Hyoseok · Jungjoon Park · Tae-Hyun Oh
This repository is an official implmentation of CLAY, which is accepted toCVPR 2026.
CLAY is an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. (CLAY-EVAL will be released soon.)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install transformers Please download stanford40 dataset from IC|TC. Then, the data structure would be as follows:
stanford40
├── action
│ ├── `applauding_001.jpg`
│ └── ...
│── location
│ ├── educational_institution
│ │ ├── `looking_through_a_microscope_003.jpg`
│ │ └── ...
│ ├── natural_environment
│ └── ...
│── mood
│ ├── adventurous
│ │ ├── `climbing_091.jpg`
│ │ └── ...
│ ├── focused
│ └── ...
python main.py --dataset <DATASET> --condition <CONDITION> --model_name <MODEL># run all benchmarks and models
bash run.sh
# or specify your own dataset, condition, model
python main.py --dataset stanford40_action --condition action --model_name clip-baseResults are reported as mAP (mean Average Precision).
@inproceedings{lim2026clay,
title={CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space},
author={Lim, Sohwi and Hyoseok, Lee and Park, Jungjoon and Oh, Tae-Hyun},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}