Clarification on differences between released VITRA-1M hand annotations and current hand reconstruction pipeline

Hi,

First of all, thank you for releasing VITRA, VITRA-1M, the pretrained model, and the data/tooling. The project is very useful for studying how in-the-wild egocentric human activity videos can be transformed into robot-aligned VLA `(image, instruction, action)` data. I also noticed from `data/data.md` that the released metadata contains reconstructed camera parameters and 3D hand information, and that the current metadata quality is expected to be around 90% by manual inspection.

I am opening this issue to ask for clarification about the relationship between the released VITRA-1M hand annotations and the current hand reconstruction pipeline in the repo.

While visualizing some VITRA-1M training samples, I observed that the first-frame hand mesh rendered from the released annotation sometimes differs noticeably from the visible hand in the image. However, when I rerun the current hand reconstruction pipeline on the same raw frame, using the repo’s inference/visualization path based on MoGe + HaWoR + MANO, the reconstructed hand mesh actually aligns better with the image.

To make sure this is not simply a frame-index mismatch, I checked the following for several samples:

- `episode["video_decode_frame"][frame_id]` matches the raw video frame index used to extract the image.
- The extracted `.jpg` matches the decoded raw video frame, apart from small JPEG re-encoding differences.
- The annotation-derived demo state was recomputed from the same episode `frame_id`, and matches the saved state with max numerical error around `1e-8`.

So it looks like the image frame and annotation frame are correctly paired, but the released annotation state and the current hand reconstruction result can still differ.

Some examples I inspected:

| sample | episode frame | raw video frame | observation |
|---|---:|---:|---|
| `training100_23_egoexo4d_EgoExo4D_cmu_bike01_7_ep_000000_f002537` | 14 | 2537 | annotation/reconstruction hand translations differ noticeably |
| `training100_63_ego4d_cooking_and_cleaning_Ego4D_015bc651-e0fe-440e-a10d-68406c548c5a_ep_000000_f016868` | 10 | 16868 | large difference between annotation and fresh reconstruction |

<img width="1440" height="360" alt="Image" src="https://github.com/user-attachments/assets/936c10d6-7a81-42db-97d3-3eb98c7a49b5" />
<img width="1080" height="360" alt="Image" src="https://github.com/user-attachments/assets/f53df3c2-d792-44d2-a3be-2b08188d0da8" />

My questions are:

1. Should the released VITRA-1M annotations be expected to match the current repo hand reconstruction pipeline exactly, or were they generated with a different/offline version of the pipeline, checkpoints, tracking settings, preprocessing, or quality filters?

2. For visualizing training samples or using VITRA for inference on training images, which hand state source do you recommend?
   - Use the released VITRA-1M annotation states when visualizing training data?
   - Or rerun the current hand reconstruction pipeline on the image to obtain the current state?

Thanks again for the release and for any guidance you can provide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on differences between released VITRA-1M hand annotations and current hand reconstruction pipeline #40

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

sample	episode frame	raw video frame	observation
`training100_23_egoexo4d_EgoExo4D_cmu_bike01_7_ep_000000_f002537`	14	2537	annotation/reconstruction hand translations differ noticeably
`training100_63_ego4d_cooking_and_cleaning_Ego4D_015bc651-e0fe-440e-a10d-68406c548c5a_ep_000000_f016868`	10	16868	large difference between annotation and fresh reconstruction

Uh oh!

Clarification on differences between released VITRA-1M hand annotations and current hand reconstruction pipeline #40

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions