Skip to content

Airfrans datapipe not broken#1475

Open
coreyjadams wants to merge 9 commits intoNVIDIA:mainfrom
coreyjadams:airfrans-datapipe-not-broken
Open

Airfrans datapipe not broken#1475
coreyjadams wants to merge 9 commits intoNVIDIA:mainfrom
coreyjadams:airfrans-datapipe-not-broken

Conversation

@coreyjadams
Copy link
Collaborator

@coreyjadams coreyjadams commented Mar 6, 2026

PhysicsNeMo Pull Request

This Pull request adds an additional datapipe to the GLOBE model for the airfrans training example in 2D. Actually, it's 2 datapipes! I'll summarize the what and why here.

Why bother doing this

The datapipe in the example is not broken or anything. It is a bespoke datapipe that reads in the VTK files from the Airfrans dataset, transforms them specifically for GLOBE, caches them to pytorch output, and restores them at training time.

The new datapipe is built on physicsnemo's composable infrastructure concepts of readers, transforms, etc. So the new dataset supports both the VTK dataset AND the hugging face arrow dataset set with just a configuration change. All of the preprocessing is done via online transformations via physicsnemo mesh (yay!) and the output is equivalent to the original outputs. There is not caching, here, since that's not supported today in physicsnemo datapipes. I'm looking at adding that this release.

Does it run as fast as caching?

Not quite. The new datapipe is on the GPU so uses a little memory (you have to take a little away from the model) and it uses a little time (about 25% longer processing time on A100).

Is it correct?

I wrote a validation script to use the old and new datapipes and cross check, I included it here but we can purge that script before final merge. I think it's slightly out dated now, actually, I would have to update it.

So why do this?

The goal here is about demonstrating the extensibility of the datapipes. I think the new transforms are clear and easy to follow, and the overall datapipe is straightforward to extend to new datasets ideally. It's meant to live side by side with the existing dataset.

How to use it?

It's enabled by swapping in the physicsnemo_dataset imports in train.py instead of dataset imports. There is also a benchmark, to run standalone tests, but that isn't strictly necessary we could drop that if you like.

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@coreyjadams coreyjadams marked this pull request as ready for review March 9, 2026 15:56
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR introduces a new physicsnemo datapipe infrastructure for the AirFRANS example, adding modular Arrow and VTK readers, composable transform classes, Hydra configuration, and wrapper classes that mirror the original AirFRANSDataSet interface.

Key issues found:

  • validate_new_datapipe.py (line 233): AirFRANSVTKReader is instantiated with data_dir=data_dir, but the constructor parameter is dataset_path. This raises TypeError at runtime.
  • dataset.pyAirFRANSDataSetDatapipe (lines 697–702 and 793–795): Two related bugs in the AirFRANSDataSetDatapipe class prevent it from working as a drop-in replacement: (1) _collate_single_datapipe returns the raw TensorDict from the datapipe instead of converting it to AirFRANSSample, and (2) preprocess() with sample_path=None also returns a raw TensorDict instead of AirFRANSSample. The parallel implementation in physicsnemo_dataset.py correctly handles both cases by calling a conversion function.
  • conf/config.yaml (lines 23–24): The default dataset_path is a hardcoded NVIDIA-internal Lustre path that will not resolve for external users and should be replaced with a placeholder.

Last reviewed commit: 1e899ac

)
n = min(args.n_samples, len(sample_paths))

reader = AirFRANSVTKReader(data_dir=data_dir, task=args.task, split=args.split)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong keyword argument to AirFRANSVTKReader

AirFRANSVTKReader.__init__ takes dataset_path as its first positional (or keyword) argument, not data_dir. This will raise a TypeError: __init__() got an unexpected keyword argument 'data_dir' at runtime.

Suggested change
reader = AirFRANSVTKReader(data_dir=data_dir, task=args.task, split=args.split)
reader = AirFRANSVTKReader(dataset_path=data_dir, task=args.task, split=args.split)

Comment on lines +697 to +702
def _collate_single_datapipe(
samples: Sequence[tuple[TensorDict, dict[str, Any]]],
) -> TensorDict:
"""Collate for batch_size=1: return the AirFRANSSample (first element of tuple)."""
data, _ = samples[0]
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_collate_single_datapipe returns TensorDict instead of AirFRANSSample

This function is intended to be a drop-in collate for the datapipe-backed dataloader. The class docstring promises the same interface as AirFRANSDataSet, meaning consumers expect each item to be an AirFRANSSample with attributes like .interior_mesh, .boundary_meshes, etc.

The datapipe's final transform (ToAirFRANSSampleStructure) emits a structured TensorDict, not an AirFRANSSample. This function returns it directly, breaking the drop-in replacement contract.

Compare with the correct implementation in physicsnemo_dataset.py (lines 153–158), which calls _structured_tensordict_to_airfrans_sample(data) before returning. Apply the same conversion here:

Suggested change
def _collate_single_datapipe(
samples: Sequence[tuple[TensorDict, dict[str, Any]]],
) -> TensorDict:
"""Collate for batch_size=1: return the AirFRANSSample (first element of tuple)."""
data, _ = samples[0]
return data
def _collate_single_datapipe(
samples: Sequence[tuple[TensorDict, dict[str, Any]]],
) -> AirFRANSSample:
"""Collate for batch_size=1: convert structured TensorDict to AirFRANSSample."""
data, _ = samples[0]
# Import the conversion helper from physicsnemo_dataset module
from physicsnemo_dataset import _structured_tensordict_to_airfrans_sample
return _structured_tensordict_to_airfrans_sample(data)

Comment on lines +793 to +795
datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset)
data, _ = datapipe[index]
return data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preprocess returns raw TensorDict instead of AirFRANSSample when sample_path is None

The return type annotation promises AirFRANSSample, but when sample_path is None, the method loads from the datapipe and returns data directly — a TensorDict from ToAirFRANSSampleStructure. Any caller using this code path will receive the wrong type and encounter AttributeError when trying to access .interior_mesh or call .to(device).

The equivalent method in physicsnemo_dataset.py (line 247) correctly converts via _structured_tensordict_to_airfrans_sample(data). Apply the same fix here:

Suggested change
datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset)
data, _ = datapipe[index]
return data
datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset)
data, _ = datapipe[index]
from physicsnemo_dataset import _structured_tensordict_to_airfrans_sample
return _structured_tensordict_to_airfrans_sample(data)

Comment on lines +23 to +24
dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/huggingface/ # used by reader=arrow
# dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/Dataset/ # used by reader=vtk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded internal NVIDIA cluster path in public config

The dataset_path default is set to a NVIDIA-internal Lustre path (/lustre/fsw/portfolios/coreai/...). This path will not resolve for anyone outside the internal cluster and should be replaced with a placeholder, requiring users to provide their own path via CLI override.

Suggested change
dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/huggingface/ # used by reader=arrow
# dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/Dataset/ # used by reader=vtk
dataset_path: /path/to/airfrans/dataset # set via CLI: dataset_path=/your/path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant