Airfrans datapipe not broken by coreyjadams · Pull Request #1475 · NVIDIA/physicsnemo

coreyjadams · 2026-03-06T22:55:18Z

PhysicsNeMo Pull Request

This Pull request adds an additional datapipe to the GLOBE model for the airfrans training example in 2D. Actually, it's 2 datapipes! I'll summarize the what and why here.

Why bother doing this

The datapipe in the example is not broken or anything. It is a bespoke datapipe that reads in the VTK files from the Airfrans dataset, transforms them specifically for GLOBE, caches them to pytorch output, and restores them at training time.

The new datapipe is built on physicsnemo's composable infrastructure concepts of readers, transforms, etc. So the new dataset supports both the VTK dataset AND the hugging face arrow dataset set with just a configuration change. All of the preprocessing is done via online transformations via physicsnemo mesh (yay!) and the output is equivalent to the original outputs. There is not caching, here, since that's not supported today in physicsnemo datapipes. I'm looking at adding that this release.

Does it run as fast as caching?

Not quite. The new datapipe is on the GPU so uses a little memory (you have to take a little away from the model) and it uses a little time (about 25% longer processing time on A100).

Is it correct?

I wrote a validation script to use the old and new datapipes and cross check, I included it here but we can purge that script before final merge. I think it's slightly out dated now, actually, I would have to update it.

So why do this?

The goal here is about demonstrating the extensibility of the datapipes. I think the new transforms are clear and easy to follow, and the overall datapipe is straightforward to extend to new datasets ideally. It's meant to live side by side with the existing dataset.

How to use it?

It's enabled by swapping in the physicsnemo_dataset imports in train.py instead of dataset imports. There is also a benchmark, to run standalone tests, but that isn't strictly necessary we could drop that if you like.

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

greptile-apps · 2026-03-09T16:05:19Z

Greptile Summary

This PR introduces a new physicsnemo datapipe infrastructure for the AirFRANS example, adding modular Arrow and VTK readers, composable transform classes, Hydra configuration, and wrapper classes that mirror the original AirFRANSDataSet interface.

Key issues found:

validate_new_datapipe.py (line 233): AirFRANSVTKReader is instantiated with data_dir=data_dir, but the constructor parameter is dataset_path. This raises TypeError at runtime.
dataset.py — AirFRANSDataSetDatapipe (lines 697–702 and 793–795): Two related bugs in the AirFRANSDataSetDatapipe class prevent it from working as a drop-in replacement: (1) _collate_single_datapipe returns the raw TensorDict from the datapipe instead of converting it to AirFRANSSample, and (2) preprocess() with sample_path=None also returns a raw TensorDict instead of AirFRANSSample. The parallel implementation in physicsnemo_dataset.py correctly handles both cases by calling a conversion function.
conf/config.yaml (lines 23–24): The default dataset_path is a hardcoded NVIDIA-internal Lustre path that will not resolve for external users and should be replaced with a placeholder.

_{Last reviewed commit: 1e899ac}

greptile-apps · 2026-03-09T16:05:23Z

examples/cfd/external_aerodynamics/globe/airfrans/validate_new_datapipe.py

+    )
+    n = min(args.n_samples, len(sample_paths))
+
+    reader = AirFRANSVTKReader(data_dir=data_dir, task=args.task, split=args.split)


Wrong keyword argument to AirFRANSVTKReader

AirFRANSVTKReader.__init__ takes dataset_path as its first positional (or keyword) argument, not data_dir. This will raise a TypeError: __init__() got an unexpected keyword argument 'data_dir' at runtime.

Suggested change

reader = AirFRANSVTKReader(data_dir=data_dir, task=args.task, split=args.split)

reader = AirFRANSVTKReader(dataset_path=data_dir, task=args.task, split=args.split)

greptile-apps · 2026-03-09T16:05:25Z

examples/cfd/external_aerodynamics/globe/airfrans/dataset.py

+def _collate_single_datapipe(
+    samples: Sequence[tuple[TensorDict, dict[str, Any]]],
+) -> TensorDict:
+    """Collate for batch_size=1: return the AirFRANSSample (first element of tuple)."""
+    data, _ = samples[0]
+    return data


_collate_single_datapipe returns TensorDict instead of AirFRANSSample

This function is intended to be a drop-in collate for the datapipe-backed dataloader. The class docstring promises the same interface as AirFRANSDataSet, meaning consumers expect each item to be an AirFRANSSample with attributes like .interior_mesh, .boundary_meshes, etc.

The datapipe's final transform (ToAirFRANSSampleStructure) emits a structured TensorDict, not an AirFRANSSample. This function returns it directly, breaking the drop-in replacement contract.

Compare with the correct implementation in physicsnemo_dataset.py (lines 153–158), which calls _structured_tensordict_to_airfrans_sample(data) before returning. Apply the same conversion here:

Suggested change

def _collate_single_datapipe(

samples: Sequence[tuple[TensorDict, dict[str, Any]]],

) -> TensorDict:

"""Collate for batch_size=1: return the AirFRANSSample (first element of tuple)."""

data, _ = samples[0]

return data

def _collate_single_datapipe(

samples: Sequence[tuple[TensorDict, dict[str, Any]]],

) -> AirFRANSSample:

"""Collate for batch_size=1: convert structured TensorDict to AirFRANSSample."""

data, _ = samples[0]

# Import the conversion helper from physicsnemo_dataset module

from physicsnemo_dataset import _structured_tensordict_to_airfrans_sample

return _structured_tensordict_to_airfrans_sample(data)

greptile-apps · 2026-03-09T16:05:26Z

examples/cfd/external_aerodynamics/globe/airfrans/dataset.py

+        datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset)
+        data, _ = datapipe[index]
+        return data


preprocess returns raw TensorDict instead of AirFRANSSample when sample_path is None

The return type annotation promises AirFRANSSample, but when sample_path is None, the method loads from the datapipe and returns data directly — a TensorDict from ToAirFRANSSampleStructure. Any caller using this code path will receive the wrong type and encounter AttributeError when trying to access .interior_mesh or call .to(device).

The equivalent method in physicsnemo_dataset.py (line 247) correctly converts via _structured_tensordict_to_airfrans_sample(data). Apply the same fix here:

Suggested change

datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset)

data, _ = datapipe[index]

return data

datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset)

data, _ = datapipe[index]

from physicsnemo_dataset import _structured_tensordict_to_airfrans_sample

return _structured_tensordict_to_airfrans_sample(data)

greptile-apps · 2026-03-09T16:05:27Z

examples/cfd/external_aerodynamics/globe/airfrans/conf/config.yaml

+dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/huggingface/  # used by reader=arrow
+# dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/Dataset/  # used by reader=vtk


Hardcoded internal NVIDIA cluster path in public config

The dataset_path default is set to a NVIDIA-internal Lustre path (/lustre/fsw/portfolios/coreai/...). This path will not resolve for anyone outside the internal cluster and should be replaced with a placeholder, requiring users to provide their own path via CLI override.

Suggested change

dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/huggingface/ # used by reader=arrow

# dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/Dataset/ # used by reader=vtk

dataset_path: /path/to/airfrans/dataset # set via CLI: dataset_path=/your/path

coreyjadams added 9 commits February 26, 2026 16:32

unbreak git

155ad94

Merge branch 'NVIDIA:main' into airfrans-datapipe-not-broken

8b44d35

Add pipeline reader updates

8b8ade4

physicsnemo datapipes work

85936c6

Add physicsnemo datapipes to airfrans

3f45e20

Fix linter

cbdc44e

Consolidate yaml configs to reduce file count bloat

f596ef6

Fix license issues

4ec797d

Merge branch 'main' into airfrans-datapipe-not-broken

1e899ac

coreyjadams marked this pull request as ready for review March 9, 2026 15:56

coreyjadams requested a review from peterdsharpe as a code owner March 9, 2026 15:56

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airfrans datapipe not broken#1475

Airfrans datapipe not broken#1475
coreyjadams wants to merge 9 commits intoNVIDIA:mainfrom
coreyjadams:airfrans-datapipe-not-broken

coreyjadams commented Mar 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	reader = AirFRANSVTKReader(data_dir=data_dir, task=args.task, split=args.split)
	reader = AirFRANSVTKReader(dataset_path=data_dir, task=args.task, split=args.split)

		dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/huggingface/ # used by reader=arrow
		# dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/Dataset/ # used by reader=vtk

	dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/huggingface/ # used by reader=arrow
	# dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/Dataset/ # used by reader=vtk
	dataset_path: /path/to/airfrans/dataset # set via CLI: dataset_path=/your/path

Conversation

coreyjadams commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PhysicsNeMo Pull Request

Why bother doing this

Does it run as fast as caching?

Is it correct?

So why do this?

How to use it?

Description

Checklist

Dependencies

Review Process

Uh oh!

greptile-apps bot commented Mar 9, 2026

Greptile Summary

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coreyjadams commented Mar 6, 2026 •

edited

Loading