Conversation
Greptile SummaryThis PR introduces a new physicsnemo datapipe infrastructure for the AirFRANS example, adding modular Arrow and VTK readers, composable transform classes, Hydra configuration, and wrapper classes that mirror the original Key issues found:
Last reviewed commit: 1e899ac |
| ) | ||
| n = min(args.n_samples, len(sample_paths)) | ||
|
|
||
| reader = AirFRANSVTKReader(data_dir=data_dir, task=args.task, split=args.split) |
There was a problem hiding this comment.
Wrong keyword argument to AirFRANSVTKReader
AirFRANSVTKReader.__init__ takes dataset_path as its first positional (or keyword) argument, not data_dir. This will raise a TypeError: __init__() got an unexpected keyword argument 'data_dir' at runtime.
| reader = AirFRANSVTKReader(data_dir=data_dir, task=args.task, split=args.split) | |
| reader = AirFRANSVTKReader(dataset_path=data_dir, task=args.task, split=args.split) |
| def _collate_single_datapipe( | ||
| samples: Sequence[tuple[TensorDict, dict[str, Any]]], | ||
| ) -> TensorDict: | ||
| """Collate for batch_size=1: return the AirFRANSSample (first element of tuple).""" | ||
| data, _ = samples[0] | ||
| return data |
There was a problem hiding this comment.
_collate_single_datapipe returns TensorDict instead of AirFRANSSample
This function is intended to be a drop-in collate for the datapipe-backed dataloader. The class docstring promises the same interface as AirFRANSDataSet, meaning consumers expect each item to be an AirFRANSSample with attributes like .interior_mesh, .boundary_meshes, etc.
The datapipe's final transform (ToAirFRANSSampleStructure) emits a structured TensorDict, not an AirFRANSSample. This function returns it directly, breaking the drop-in replacement contract.
Compare with the correct implementation in physicsnemo_dataset.py (lines 153–158), which calls _structured_tensordict_to_airfrans_sample(data) before returning. Apply the same conversion here:
| def _collate_single_datapipe( | |
| samples: Sequence[tuple[TensorDict, dict[str, Any]]], | |
| ) -> TensorDict: | |
| """Collate for batch_size=1: return the AirFRANSSample (first element of tuple).""" | |
| data, _ = samples[0] | |
| return data | |
| def _collate_single_datapipe( | |
| samples: Sequence[tuple[TensorDict, dict[str, Any]]], | |
| ) -> AirFRANSSample: | |
| """Collate for batch_size=1: convert structured TensorDict to AirFRANSSample.""" | |
| data, _ = samples[0] | |
| # Import the conversion helper from physicsnemo_dataset module | |
| from physicsnemo_dataset import _structured_tensordict_to_airfrans_sample | |
| return _structured_tensordict_to_airfrans_sample(data) |
| datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset) | ||
| data, _ = datapipe[index] | ||
| return data |
There was a problem hiding this comment.
preprocess returns raw TensorDict instead of AirFRANSSample when sample_path is None
The return type annotation promises AirFRANSSample, but when sample_path is None, the method loads from the datapipe and returns data directly — a TensorDict from ToAirFRANSSampleStructure. Any caller using this code path will receive the wrong type and encounter AttributeError when trying to access .interior_mesh or call .to(device).
The equivalent method in physicsnemo_dataset.py (line 247) correctly converts via _structured_tensordict_to_airfrans_sample(data). Apply the same fix here:
| datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset) | |
| data, _ = datapipe[index] | |
| return data | |
| datapipe: PhysicsnemoDataset = hydra.utils.instantiate(cfg.dataset) | |
| data, _ = datapipe[index] | |
| from physicsnemo_dataset import _structured_tensordict_to_airfrans_sample | |
| return _structured_tensordict_to_airfrans_sample(data) |
| dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/huggingface/ # used by reader=arrow | ||
| # dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/Dataset/ # used by reader=vtk |
There was a problem hiding this comment.
Hardcoded internal NVIDIA cluster path in public config
The dataset_path default is set to a NVIDIA-internal Lustre path (/lustre/fsw/portfolios/coreai/...). This path will not resolve for anyone outside the internal cluster and should be replaced with a placeholder, requiring users to provide their own path via CLI override.
| dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/huggingface/ # used by reader=arrow | |
| # dataset_path: /lustre/fsw/portfolios/coreai/projects/coreai_modulus_cae/datasets/airfrans/Dataset/ # used by reader=vtk | |
| dataset_path: /path/to/airfrans/dataset # set via CLI: dataset_path=/your/path |
PhysicsNeMo Pull Request
This Pull request adds an additional datapipe to the GLOBE model for the airfrans training example in 2D. Actually, it's 2 datapipes! I'll summarize the what and why here.
Why bother doing this
The datapipe in the example is not broken or anything. It is a bespoke datapipe that reads in the VTK files from the Airfrans dataset, transforms them specifically for GLOBE, caches them to pytorch output, and restores them at training time.
The new datapipe is built on physicsnemo's composable infrastructure concepts of readers, transforms, etc. So the new dataset supports both the VTK dataset AND the hugging face arrow dataset set with just a configuration change. All of the preprocessing is done via online transformations via physicsnemo mesh (yay!) and the output is equivalent to the original outputs. There is not caching, here, since that's not supported today in physicsnemo datapipes. I'm looking at adding that this release.
Does it run as fast as caching?
Not quite. The new datapipe is on the GPU so uses a little memory (you have to take a little away from the model) and it uses a little time (about 25% longer processing time on A100).
Is it correct?
I wrote a validation script to use the old and new datapipes and cross check, I included it here but we can purge that script before final merge. I think it's slightly out dated now, actually, I would have to update it.
So why do this?
The goal here is about demonstrating the extensibility of the datapipes. I think the new transforms are clear and easy to follow, and the overall datapipe is straightforward to extend to new datasets ideally. It's meant to live side by side with the existing dataset.
How to use it?
It's enabled by swapping in the
physicsnemo_datasetimports intrain.pyinstead ofdatasetimports. There is also a benchmark, to run standalone tests, but that isn't strictly necessary we could drop that if you like.Description
Checklist
Dependencies
Review Process
All PRs are reviewed by the PhysicsNeMo team before merging.
Depending on which files are changed, GitHub may automatically assign a maintainer for review.
We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.
AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.