Skip to content

Clarification on differences between released VITRA-1M hand annotations and current hand reconstruction pipeline #40

Description

@Bobyue0118

Hi,

First of all, thank you for releasing VITRA, VITRA-1M, the pretrained model, and the data/tooling. The project is very useful for studying how in-the-wild egocentric human activity videos can be transformed into robot-aligned VLA (image, instruction, action) data. I also noticed from data/data.md that the released metadata contains reconstructed camera parameters and 3D hand information, and that the current metadata quality is expected to be around 90% by manual inspection.

I am opening this issue to ask for clarification about the relationship between the released VITRA-1M hand annotations and the current hand reconstruction pipeline in the repo.

While visualizing some VITRA-1M training samples, I observed that the first-frame hand mesh rendered from the released annotation sometimes differs noticeably from the visible hand in the image. However, when I rerun the current hand reconstruction pipeline on the same raw frame, using the repo’s inference/visualization path based on MoGe + HaWoR + MANO, the reconstructed hand mesh actually aligns better with the image.

To make sure this is not simply a frame-index mismatch, I checked the following for several samples:

  • episode["video_decode_frame"][frame_id] matches the raw video frame index used to extract the image.
  • The extracted .jpg matches the decoded raw video frame, apart from small JPEG re-encoding differences.
  • The annotation-derived demo state was recomputed from the same episode frame_id, and matches the saved state with max numerical error around 1e-8.

So it looks like the image frame and annotation frame are correctly paired, but the released annotation state and the current hand reconstruction result can still differ.

Some examples I inspected:

sample episode frame raw video frame observation
training100_23_egoexo4d_EgoExo4D_cmu_bike01_7_ep_000000_f002537 14 2537 annotation/reconstruction hand translations differ noticeably
training100_63_ego4d_cooking_and_cleaning_Ego4D_015bc651-e0fe-440e-a10d-68406c548c5a_ep_000000_f016868 10 16868 large difference between annotation and fresh reconstruction
Image Image

My questions are:

  1. Should the released VITRA-1M annotations be expected to match the current repo hand reconstruction pipeline exactly, or were they generated with a different/offline version of the pipeline, checkpoints, tracking settings, preprocessing, or quality filters?

  2. For visualizing training samples or using VITRA for inference on training images, which hand state source do you recommend?

    • Use the released VITRA-1M annotation states when visualizing training data?
    • Or rerun the current hand reconstruction pipeline on the image to obtain the current state?

Thanks again for the release and for any guidance you can provide.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions