In Reconstruction/gsfixer/cogvideo/inference.py, lines 105–118, control_pixel_values is normalized to [-1, 1] via (frames - 0.5) * 2. However, first_image and last_image are then converted by (image * 255).astype(np.uint8) directly in the [-1, 1] range without mapping back to [0, 1] first. This appears to produce incorrect pixel values for the VGGT and DINO inputs.
Here is the relevant code snippet:
control_pixel_values = (frames - 0.5) * 2
control_pixel_values = control_pixel_values.permute(1, 0, 2, 3).unsqueeze(0)
ref_first_last_pixel_values = torch.cat([control_pixel_values[:, :, 0, :, :].unsqueeze(2), control_pixel_values[:, :, -1, :, :].unsqueeze(2)], dim=2)
ref_first_last_image_path = [] # for vggt
first_image = control_pixel_values[:, :, 0, :, :].squeeze(0).cpu().clone().permute(1, 2, 0).numpy()
first_image = (first_image * 255).astype(np.uint8)
first_image = Image.fromarray(first_image)
ref_first_last_image_path.append(first_image)
last_image = control_pixel_values[:, :, -1, :, :].squeeze(0).cpu().clone().permute(1, 2, 0).numpy()
last_image = (last_image * 255).astype(np.uint8)
last_image = Image.fromarray(last_image)
ref_first_last_image_path.append(last_image)
vggt_images = load_and_preprocess_images_(ref_first_last_image_path).to(self.opts.device)
dino_latents = self.image_encoder(vggt_images).last_hidden_state[:, 5:, :].to(self.opts.weight_dtype)
output_list, patch_start_idx = self.vggt.aggregator.forward(vggt_images.unsqueeze(0))
vggt_latents = output_list[-1][:, :, patch_start_idx:, :].squeeze(0).to(self.opts.weight_dtype)
The correct inverse transform should be ((image + 1) / 2 * 255) or equivalent. Could you please take a look and help clarify whether this is indeed a bug? Thanks!
In
Reconstruction/gsfixer/cogvideo/inference.py, lines 105–118,control_pixel_valuesis normalized to[-1, 1]via(frames - 0.5) * 2. However,first_imageandlast_imageare then converted by(image * 255).astype(np.uint8)directly in the[-1, 1]range without mapping back to[0, 1]first. This appears to produce incorrect pixel values for the VGGT and DINO inputs.Here is the relevant code snippet:
The correct inverse transform should be
((image + 1) / 2 * 255)or equivalent. Could you please take a look and help clarify whether this is indeed a bug? Thanks!