Issue replicate Vlaser-8b VLM on Embodiedbench EB-ALFRED

Dear Authors,

I hope you are doing well.

I am currently trying to reproduce the reported results for Vlaser-8B. So far, my EB-Habitat result is relatively close to the reported number, but my EB-ALFRED performance is much lower than expected.

For reference, the reported and reproduced results are:

**Expected**

* EB-ALFRED average success rate: **0.50**
* EB-Habitat success rate: **0.40**

**Mine**

* EB-ALFRED average success rate: **0.10**
* EB-Habitat success rate: **0.42**

My setup is as follows:

* Cluster: SLURM + Singularity (H100)
* Model: `OpenGVLab/Vlaser-8B`
* Code: official repository, with only launcher/runtime adaptations for SLURM/Singularity (mainly display, environment, and path handling)

For EB-ALFRED, the detailed `task_success` results are:

* base: **0.16**
* common_sense: **0.14**
* complex_instruction: **0.12**
* visual_appearance: **0.10**
* spatial: **0.08**
* long_horizon: **0.00**
* mean over the 6 subsets: **0.10**

Other aggregate statistics (mean over 6 subsets) are:

* task_progress: **0.1678**
* num_invalid_actions: **8.68**
* planner_output_error: **0.46**

I was wondering whether you might be able to share your evaluation setup used for the published Vlaser-8B EB-ALFRED result.

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue replicate Vlaser-8b VLM on Embodiedbench EB-ALFRED #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue replicate Vlaser-8b VLM on Embodiedbench EB-ALFRED #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions