Hi,
First of all, thank you for releasing VITRA, VITRA-1M, the pretrained model, and the data/tooling. The project is very useful for studying how in-the-wild egocentric human activity videos can be transformed into robot-aligned VLA (image, instruction, action) data. I also noticed from data/data.md that the released metadata contains reconstructed camera parameters and 3D hand information, and that the current metadata quality is expected to be around 90% by manual inspection.
I am opening this issue to ask for clarification about the relationship between the released VITRA-1M hand annotations and the current hand reconstruction pipeline in the repo.
While visualizing some VITRA-1M training samples, I observed that the first-frame hand mesh rendered from the released annotation sometimes differs noticeably from the visible hand in the image. However, when I rerun the current hand reconstruction pipeline on the same raw frame, using the repo’s inference/visualization path based on MoGe + HaWoR + MANO, the reconstructed hand mesh actually aligns better with the image.
To make sure this is not simply a frame-index mismatch, I checked the following for several samples:
episode["video_decode_frame"][frame_id] matches the raw video frame index used to extract the image.
- The extracted
.jpg matches the decoded raw video frame, apart from small JPEG re-encoding differences.
- The annotation-derived demo state was recomputed from the same episode
frame_id, and matches the saved state with max numerical error around 1e-8.
So it looks like the image frame and annotation frame are correctly paired, but the released annotation state and the current hand reconstruction result can still differ.
Some examples I inspected:
| sample |
episode frame |
raw video frame |
observation |
training100_23_egoexo4d_EgoExo4D_cmu_bike01_7_ep_000000_f002537 |
14 |
2537 |
annotation/reconstruction hand translations differ noticeably |
training100_63_ego4d_cooking_and_cleaning_Ego4D_015bc651-e0fe-440e-a10d-68406c548c5a_ep_000000_f016868 |
10 |
16868 |
large difference between annotation and fresh reconstruction |
My questions are:
-
Should the released VITRA-1M annotations be expected to match the current repo hand reconstruction pipeline exactly, or were they generated with a different/offline version of the pipeline, checkpoints, tracking settings, preprocessing, or quality filters?
-
For visualizing training samples or using VITRA for inference on training images, which hand state source do you recommend?
- Use the released VITRA-1M annotation states when visualizing training data?
- Or rerun the current hand reconstruction pipeline on the image to obtain the current state?
Thanks again for the release and for any guidance you can provide.
Hi,
First of all, thank you for releasing VITRA, VITRA-1M, the pretrained model, and the data/tooling. The project is very useful for studying how in-the-wild egocentric human activity videos can be transformed into robot-aligned VLA
(image, instruction, action)data. I also noticed fromdata/data.mdthat the released metadata contains reconstructed camera parameters and 3D hand information, and that the current metadata quality is expected to be around 90% by manual inspection.I am opening this issue to ask for clarification about the relationship between the released VITRA-1M hand annotations and the current hand reconstruction pipeline in the repo.
While visualizing some VITRA-1M training samples, I observed that the first-frame hand mesh rendered from the released annotation sometimes differs noticeably from the visible hand in the image. However, when I rerun the current hand reconstruction pipeline on the same raw frame, using the repo’s inference/visualization path based on MoGe + HaWoR + MANO, the reconstructed hand mesh actually aligns better with the image.
To make sure this is not simply a frame-index mismatch, I checked the following for several samples:
episode["video_decode_frame"][frame_id]matches the raw video frame index used to extract the image..jpgmatches the decoded raw video frame, apart from small JPEG re-encoding differences.frame_id, and matches the saved state with max numerical error around1e-8.So it looks like the image frame and annotation frame are correctly paired, but the released annotation state and the current hand reconstruction result can still differ.
Some examples I inspected:
training100_23_egoexo4d_EgoExo4D_cmu_bike01_7_ep_000000_f002537training100_63_ego4d_cooking_and_cleaning_Ego4D_015bc651-e0fe-440e-a10d-68406c548c5a_ep_000000_f016868My questions are:
Should the released VITRA-1M annotations be expected to match the current repo hand reconstruction pipeline exactly, or were they generated with a different/offline version of the pipeline, checkpoints, tracking settings, preprocessing, or quality filters?
For visualizing training samples or using VITRA for inference on training images, which hand state source do you recommend?
Thanks again for the release and for any guidance you can provide.