CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Sohwi Lim · Lee Hyoseok · Jungjoon Park · Tae-Hyun Oh

Overview

Paper | Project Page

This repository is an official implmentation of CLAY, which is accepted toCVPR 2026.

CLAY is an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. (CLAY-EVAL will be released soon.)

Installation

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install transformers

Dataset Preparation

Stanford40

Please download stanford40 dataset from IC|TC. Then, the data structure would be as follows:

 stanford40
├── action
│   ├── `applauding_001.jpg`
│   └── ...
│── location
│   ├── educational_institution
│   │   ├── `looking_through_a_microscope_003.jpg`
│   │   └── ...
│   ├── natural_environment
│   └── ...
│── mood
│   ├── adventurous
│   │   ├── `climbing_091.jpg`
│   │   └── ...
│   ├── focused
│   └── ...

Inference

python main.py --dataset <DATASET> --condition <CONDITION> --model_name <MODEL>

Example

# run all benchmarks and models
bash run.sh
# or specify your own dataset, condition, model
python main.py --dataset stanford40_action --condition action --model_name clip-base

Evaluation

Results are reported as mAP (mean Average Precision).

Citation

@inproceedings{lim2026clay,
  title={CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space},
  author={Lim, Sohwi and Hyoseok, Lee and Park, Jungjoon and Oh, Tae-Hyun},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/jsons		data/jsons
src		src
README.md		README.md
main.py		main.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Overview

Paper | Project Page

Installation

Dataset Preparation

Stanford40

Inference

Example

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Overview

Paper | Project Page

Installation

Dataset Preparation

Stanford40

Inference

Example

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages