PanoSAMic is a semantic segmentation model for panoramic images that integrates the pre-trained Segment Anything Model (SAM) encoder with multi-modal fusion capabilities. Existing image foundation models are not optimized for spherical images, having been trained primarily on perspective images. PanoSAMic addresses this by modifying the SAM encoder to output multi-stage features and introducing a novel spatio-modal fusion module that allows the model to select relevant modalities and features for different areas of the input.
Our semantic decoder uses spherical attention and dual view fusion to overcome the distortions and edge discontinuity often associated with panoramic images. PanoSAMic achieves state-of-the-art results on:
- Stanford2D3DS: RGB, RGB-D, and RGB-D-N modalities
- Matterport3D: RGB and RGB-D modalities
GPU requirements: ≥16 GB VRAM for ViT-H inference · ≥24 GB for training · Apple Silicon (MPS) device support is included but has not been verified on physical hardware
-
Clone the repository and install dependencies:
git clone git@github.com:dfki-av/PanoSAMic.git cd PanoSAMic uv sync -
SAM backbone weights — choose one option:
-
Auto-download (recommended): pass
--sam_weights_pathto any script and the weights are fetched from Meta's servers on first use and cached under~/.cache/panosamic/sam/. -
Manual download: grab the weights from the SAM repository and place or symlink them in
sam_weights/:ln -s /path/to/sam/weights/* sam_weights/
-
Train PanoSAMic on a dataset using the training script:
python panosamic/evaluation/train.py \
--dataset_path /path/to/processed/dataset \
--config_path config/config_stanford2d3ds_dv.json \
--experiments_path ./experiments \
--sam_weights_path ./sam_weights \
--dataset stanford2d3ds \
--fold 1 \
--batch_size 1 \
--epochs 50 \
--vit_model vit_h \
--modalities image,depth,normals \
--num_gpus 1 \
--workers_per_gpu 2Key Parameters:
--dataset: Choose fromstanford2d3ds,matterport3d, orstructured3d--vit_model: SAM encoder variant (vit_h,vit_l, orvit_b)--modalities: Comma-separated modalities (image,depth,normals)--fold: Dataset fold number for cross-validation--resume: Continue training fromlastorbestcheckpoint
Evaluate a local training run (full checkpoint from ./experiments):
python panosamic/evaluation/evaluate.py \
--dataset_path /path/to/processed/dataset \
--config_path config/config_stanford2d3ds_dv.json \
--experiments_path ./experiments \
--dataset stanford2d3ds \
--fold 1 \
--vit_model vit_h \
--modalities image,depth,normals \
--num_gpus 1 \
--workers_per_gpu 2Reproduce paper results directly from the Hub (no local training run needed).
The frozen SAM backbone is fetched automatically if --sam_weights_path is
omitted:
python panosamic/evaluation/evaluate.py \
--dataset_path /path/to/processed/dataset \
--config_path config/config_stanford2d3ds_dv.json \
--checkpoint dfki-av/PanoSAMic \
--subfolder stanford2d3ds-vith-rgbdn-fold1 \
--sam_weights_path ./sam_weights \
--dataset stanford2d3ds \
--fold 1 \
--vit_model vit_h \
--modalities image,depth,normals \
--num_gpus 1--checkpoint also accepts a local path to a model.safetensors file or a
directory containing one (e.g. exported via scripts/export_checkpoint_for_hub.py).
See MODEL_CARD.md for the full checkpoint table and the
numbers each checkpoint reproduces.
Configuration files in the config/ directory control model architecture and training parameters. Available configs:
config_stanford2d3ds_dv.json- Stanford2D3DS dual-view configurationconfig_stanford2d3ds_sv.json- Stanford2D3DS single-view configurationconfig_matterport3d_dv.json- Matterport3D dual-view configurationconfig_baseline.json- Baseline configuration
For comparison with SAM3 baselines, install the optional SAM3 dependency:
uv sync --extra sam3Run SAM3 evaluation scripts:
# Stanford2D3DS evaluation
DATASET_PATH=/path/to/processed/dataset ./scripts/run_sam3_eval_stanford2d3ds.sh
# Matterport3D evaluation
DATASET_PATH=/path/to/processed/dataset ./scripts/run_sam3_eval_matterport3d.shThe SAM3 model (facebook/sam3) is loaded via HuggingFace Transformers and downloaded automatically to your cache on first run.
# Full CPU test suite (no GPU required)
uv run pytest tests/
# Skip CUDA tests explicitly (e.g. when GPU is in use)
uv run pytest tests/ --ignore=tests/model/smoke/test_cuda.py \
--ignore=tests/sam3/smoke/test_cuda.py \
--ignore=tests/sam3/outputs/test_cuda.py
# Hub integration tests (downloads ~750 MB–1.5 GB from dfki-av/PanoSAMic)
PANOSAMIC_HUB_TESTS=1 uv run pytest tests/model/test_hub.py -vHub tests are skipped by default to avoid network I/O in regular runs.
Set PANOSAMIC_HUB_TESTS=1 to verify that released checkpoints still load
correctly and contain no SAM backbone weights.
The checkpoint size reflects the trainable weights only (no SAM backbone):
~367 M parameters, ~740 MB in bfloat16 or ~1.5 GB in float32.
uv run ruff check --fix # lint with auto-fix
uv run ruff format # format
uv run ty check # type checkPre-commit runs all three automatically on every commit.
Download the datasets from their respective sources:
- Stanford-2D-3D-S: https://github.com/alexsax/2D-3D-Semantics
- Matterport-3D (pre-processed 360FV-Matterport): https://github.com/InSAI-Lab/360BEV
- Structured-3D: https://github.com/bertjiazheng/Structured3D
After downloading the data from their respective sources, use the scripts in panosamic/data_preparation/ to process them in the correct structure.
| Original folder structure | Processed folder structure |
|---|---|
area_1/
pano/
depth/
[sample_name].png
normal/
[sample_name].png
rgb/
[sample_name].png
semantic/
[sample_name].png
area_2/
area_3/
area_4/
area_5a/
area_5b/
area_6/
assets/ |
area_1/
[sample_name]/
depth.png
depth_mask.webp
instances.webp
normals.webp
rgb.webp
area_2/
area_3/
area_4/
area_5a/
area_5b/
area_6/
assets/
[cache_files] |
| Original folder structure | Processed folder structure |
|---|---|
[scene_name]/
depth/
[sample_name].png
rgb/
[sample_name].jpg
semantic/
[sample_name].png
...
[scene_name]/ |
[scene_name]/
[sample_name]/
depth.png
depth_mask.webp
rgb.webp
semantics.png
...
[scene_name]/
[cache_files] |
| Original folder structure | Processed folder structure |
|---|---|
[scene_name]/
2D_rendering/
[sample_name]/
panorama/
full/
albedo.png
depth.png
instance.png
normal.png
rgb_coldlight.png
rgb_rawlight.png
rgb_warmlight.png
semantic.png
...
[scene_name]/
assets/ |
[scene_name]/
[sample_name]/
depth_mask.webp
depth.png
normals.webp
rgb.webp
semantics.png
...
[scene_name]/
assets/
[cache_files] |
@article{chamseddine2026panosamic,
title = {PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion},
author = {Chamseddine, Mahdi and Stricker, Didier and Rambach, Jason},
journal = {arXiv preprint arXiv:2601.07447},
year = {2026}
}
This research was funded by the European Union as part of the projects: HumanTech (Grant Agreement 101058236) and ShieldBOT (Grant Agreement 101235093).
This project is modfies parts of the Segment Anything Model (SAM).
- Original SAM Code: Licensed under Apache 2.0 by Meta AI.
- Modified and Additional Components: The modified encoder code in this repository is licensed under CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike).
This code is designed to use the official pretrained SAM weights from Meta AI. The weights remain under their original Apache 2.0 license.