Multimodal Input: from_webdataset() #560

valerie-cal · 2025-06-15T17:10:22Z

Summary

Implemented the function from_webdataset() in dataset.py that converts a webdataset .tar file into a Rivulet Dataset instance; created a series of unit tests for validation.

Rationale

from_webdataset() will allow Rivulet databases to hold WebDataset data, alongside other data formats such as JSON and CSV.

Changes

Implemented from_webdataset() in deltacat/storage/rivulet/dataset.py that reads from a given WebDataset and creates a Rivulet dataset from the JSON and image metadata
Added batch read functionality such that WebDataset is read in user provided batch_sizes
Handles WebDatasets containing assorted and nested datatypes
Uses merge keys to extract image metadata and store in binary for ease of integration of classification using HuggingFace datasets and models

Testing

Test suite deltacat/tests/storage/rivulet/schema/test_wds.py contain unit tests testing the following functionality of from_webdataset(). All test cases pass and from_webdataset() does not break existing Rivulet or deltacat functions.

Test that Schema correctly stores Field objects with their types
Test that from_webdataset correctly identifies all fields in the tar file
Test that data values are correctly extracted from the tar file
Test that merge keys are correctly identified and set in the schema
Test that specifying a non-existent field as merge key raises an error
Test that field datatypes are correctly inferred from the data
Test that metadata directory is properly initialized
Test that fields in the dataset are proper Field objects
Test that from_webdataset correctly identifies all fields in the tar file if the jsons are inconsistent
Test that image_binary is an added column after Dataset is created from webdataset

…nto wds-schema

…tacat into wds-schema

Co-authored-by: nsaadhvi <[email protected]>

Signed-off-by: Neeral Bhalgat <[email protected]>

pdames

Hmm... it seems like this PR includes a lot of superfluous/untouched files in the diff that are creating unnecessary conflicts. Maybe the changes can be squashed and rebased on top of the latest from https://github.com/ray-project/deltacat/tree/2.0 then resubmitted to clean this up?

flliver · 2025-06-26T19:14:56Z

deltacat/examples/rivulet/wds_demo.py

+from transformers import AutoImageProcessor, AutoModelForImageClassification
+
+
+#tar_path = "deltacat/tests/test_utils/resources/imagenet1k-train-0000.tar"


Remove any unused code

flliver · 2025-06-26T19:27:48Z

deltacat/examples/rivulet/wds_demo.py

+        # Create a list of dictionaries combining filename and predicted species
+        rows_to_write = [
+            {
+                "filename": fname,
+                "bird_species": bird_labels[idx]
+            }
+            for idx, fname in enumerate(filenames)
+        ]


This is a weird way to do this, zip() the two would be cleaner.

flliver · 2025-06-26T19:30:55Z

deltacat/storage/rivulet/dataset.py

+        with tarfile.open(file_uri, "r") as tar:
+            tar_members = tar.getmembers()
+            current_batch = None
+            reading_frame_size = batch_size # TODO: Use batch size 1 for now.
+            total_batches = math.ceil(len(tar_members) / reading_frame_size)
+
+            for i in range(total_batches):
+                reading_frame_start = i * reading_frame_size
+                reading_frame_end = reading_frame_start + reading_frame_size
+                for member in tar_members[reading_frame_start:reading_frame_end]:
+                    # Ignore hidden files if the imported tar isn't cleaned.
+                    if member.name.startswith("._"):
+                        continue
+                    if member.isfile() and member.name.endswith(".json"):
+                        f = tar.extractfile(member)
+                        if f:
+                            try:
+                                merge_key = merge_keys
+
+                                pyarrow_table = pyarrow.json.read_json(f)
+                                image_filename = pyarrow_table[merge_key][0].as_py() 
+
+                                # truncated_filename = normalize_filename(image_filename[image_filename.index('/') + 1:])
+                                truncated_filename = normalize_filename(os.path.basename(image_filename))
+                                if truncated_filename in [normalize_filename(t.name) for t in tar_members]:
+                                    image_member = next((t for t in tar_members if t.name == truncated_filename), None)
+                                    if image_member:
+                                        fi = tar.extractfile(image_member)
+                                        if fi:
+                                            media_binary = fi.read()
+                                            media_binaries.extend([media_binary])
+
+                                if current_batch is None:
+                                    current_batch = pyarrow_table
+                                else:
+                                    current_batch = pa.concat_tables([current_batch, pyarrow_table])
+                            except Exception as e:
+                                print(f"Error with {member.name}:", e)
+
+                            if current_batch is not None:
+                                try:
+                                    dataset_schema.merge(Schema.from_pyarrow(current_batch.schema, merge_keys=merge_keys))
+                                except Exception as e:
+                                    print(f"Error merging schema: {e}")
+
+            if current_batch is not None and media_binaries:
+                if len(media_binaries) == current_batch.num_rows:
+                    try:
+                        image_column = pyarrow.array(media_binaries, type=pyarrow.binary())
+                        current_batch = current_batch.add_column(
+                            len(current_batch.schema),
+                            'media_binary',
+                            image_column
+                        )
+                        # Edit dataset_schema to have media_binaries as a field object
+                        dataset_schema.add_field(Field('media_binary', Datatype.binary(image_filename[image_filename.index('.') + 1:].lower())))
+                    except Exception as e:
+                        print(f"Mismatch between media binaries and batch rows: {e}")
+
+


This is quite heavily nested. Not worth it now, but I'd extract this into a WebDatasetReader sort of class that manges all this in a non-nested way if we do more development.

flliver · 2025-06-26T19:31:59Z

deltacat/storage/rivulet/dataset.py

+                                pyarrow_table = pyarrow.json.read_json(f)
+                                image_filename = pyarrow_table[merge_key][0].as_py() 
+
+                                # truncated_filename = normalize_filename(image_filename[image_filename.index('/') + 1:])


This blows up if you have multiple merge keys. Need to fix.

We have currently handled the following cases:

If there are multiple merge keys, raise a ValueError

If the merge key input is a list with one merge key, proceed with the one merge key.

We can adapt this to handle multiple merge keys instead of raising an error if wanted.

flliver · 2025-07-01T15:54:44Z

deltacat/storage/rivulet/schema_test.py

+@dataclass(frozen=True)
+class Field:
+    name: str
+    datatype: Datatype
+    is_merge_key: bool = False
+
+class Schema(MutableMapping[str, Field]):
+    def __init__(
+        self,
+        fields: Iterable[Tuple[str, Datatype] | Field] = None,
+        merge_keys: Optional[Iterable[str]] = None,
+    ):
+        self._fields: Dict[str, Field] = {}


Not understanding why a class called Field is in a file called schema_test?

This file was just for our own purposes of understanding the classes and files, not for the PR, so we have removed it completely.

The Field and Schema classes were already classes defined in schema.py, but just for the sake of ease we copied them into this file just for our own understanding haha.

We've moved the process_tar() function into test_wds.py for now, just because it could be helpful test util, but we are not sure if that is the best place to put it (or if we even want to keep it).

flliver · 2025-07-01T16:00:59Z

deltacat/tests/storage/rivulet/schema/test_wds.py

+    """Test that from_webdataset correctly identifies all fields in the tar file."""
+    tar_path = "../../../test_utils/resources/test_wds.tar"
+    dataset = Dataset.from_webdataset(
+        name="test_webdataset",


Generally don't like static, pre-generated test objects like this. The class should generate the wds file prior to the test using a standard wds function, then test on the created file, then delete the file at the end. This ensures if wds library/standard changes and that change impacts serialization, the tests actually fail properly. Right now, there's no confidence the .tar is actually a real wds file, and no clear understanding of what's in that file or how it was generated.

We looked into dynamically generating a webdataset, but it's a bit tricky in this case since typical webdatasets include media files like .jpg, which are not straightforward to create dynamically in a lightweight way. We could use .txt files, but that wouldn't fully test the expected use case.

Also, we noticed that other parts of the codebase (like CSV and Parquet) have some static test files, so we planned to follow that pattern by adding a minimal .tar file for testing. Happy to revisit this if there's a preferred way to generate valid webdataset test data inline.

flliver · 2025-07-01T16:02:17Z

deltacat/tests/storage/rivulet/schema/test_wds.py

+def test_metadata_directory_creation(tmp_path):
+    """Test that metadata directory is properly initialized."""
+    tar_path = "../../../test_utils/resources/test_wds.tar"
+    dataset = Dataset.from_webdataset(
+        name="test_meta",
+        file_uri=tar_path,
+        metadata_uri=tmp_path,
+        merge_keys="filename"
+    )
+    assert hasattr(dataset, "_metadata_path")
+    assert dataset._metadata_path is not None


Don't test internal attributes like this. Remove this test and test whatever it is the _metadata_path is attempting to create (i.e. fetch some useful metdata that would require the metadata path dir to exist)

We created a test test_dataset_persistence_and_reloading which successfully creates, saves, and scans the dataset. We take this to mean that the metadata was successfully written and the path successfully exists, but we can also explore more comprehensive tests.

flliver · 2025-07-01T16:04:35Z

foo.py

+import csv
+import pyarrow as pa
+import pyarrow.compute as pc
+
+animal = pa.array(["sheep", "cows", "horses", "foxes", "sheep"], type=pa.string())
+count = pa.array([12, 5, 2, 1, 10], type=pa.int8())
+year = pa.array([2022, 2022, 2022, 2022, 2021], type=pa.int16())
+
+# Creating a table from arrays
+table = pa.Table.from_arrays([animal, count, year], names=['animal', 'count', 'year'])


File doesn't belong in commit?

We've removed this file.

flliver · 2025-07-01T16:05:11Z

foowds.py

+import os
+import json
+import tarfile
+import io
+import numpy as np
+from PIL import Image
+
+# Create mock data directory
+if not os.path.exists('mock_data'):
+    os.makedirs('mock_data')
+
+# Sample IDs for medical papers (similar to the example)
+sample_ids = [
+    "",
+    "PMC4129566_00003",
+    "PMC4872614_00002",


File doesn't belong in commit? Or this needs to move it to tests/utils or some equivalent so it can be used by the wds test class.

We've removed this file.

flliver · 2025-07-01T16:05:42Z

deltacat/tests/storage/rivulet/schema/test_wds.py

+import itertools
+import pytest
+import pyarrow as pa
+import json
+import tarfile
+from deltacat.storage.rivulet import Dataset, Schema, Field, Datatype
+
+
+def test_schema_field_types():
+    """Test that Schema correctly stores Field objects with their types."""
+    fields = [


Seems to be missing an actual data read/write, all tests are about schema.

All our tests now use Pytest fixtures to create tar files at the beginning of each test. They each run from_webdataset() , so that should include verifying that reading from the tar file worked. Then each test checks the fields, types and some values in the Dataset produced by from_webdataset(), which I believe should handle checking that writing worked.

We put all test cases in a class called TestFromWebDataset, in order to ensure proper set up and teardown, leaving no directories lying around. This meant that we only used 1 temp_dir directory from Pytest's tmp_path, so we named datasets, tar files, and JSON files uniquely to their respective entities in lieu of that.

The one thing leftover is testing with image files (instead of .txt files), which we left a TODO and commented out code in for, and can address once we discuss more on the specifics of the matter!

025rhu · 2025-08-17T23:51:02Z

To your comment @pdames (Hmm... it seems like this PR includes a lot of superfluous/untouched files in the diff that are creating unnecessary conflicts. Maybe the changes can be squashed and rebased on top of the latest from https://github.com/ray-project/deltacat/tree/2.0 then resubmitted to clean this up?):

It looks like we accidentally changed the executable mode of ~400 files in one commit (most likely by running chmod +x at some point).

We have merged the latest 2.0 into the PR (which unfortunately did not resolve the executable mode issue), and changed the executable modes back for all relevant files.

025rhu · 2025-08-18T07:35:25Z

Overall changes:

merged in the latest 2.0
resolved issue with the extra ~400 files modified in this PR
error out if there are multiple merge keys (for now)
new, working test suite for from_webdataset() in test_wds.py; deleted old test suite
ran and passed linter

…ogic and fixed media_binary logic in from_webdataset

…d record count check in inconsistent schema test for the WDS reader

…tion handling for WebDatasetReader

valerie-cal and others added 30 commits March 10, 2025 15:15

webdataset schema finished, test cases not working

b422b80

Merge branch '2.0' of https://github.com/nsaadhvi/codebase-deltacat i…

8184a1e

…nto wds-schema

wds schema pytest and wds schema logic

44e32a8

edits to building wds schema

fe348f8

Initial implementation of from_webdataset

f8bf30f

demo to load web data set

7c5df3e

fixed error with reading tar files and added sample tar files

7c8f880

finished pytests

dc348f3

Merge branch 'wds-schema' of https://github.com/nsaadhvi/codebase-del…

1ee3ba2

…tacat into wds-schema

inconsistent jsons pytest added

0111b62

comment out failing test

124a3f6

restored failing test case

6ebcf18

Merge branch 'wds-schema' of https://github.com/nsaadhvi/codebase-del…

a843e07

…tacat into wds-schema

create datasets in tmp_path

5a060ac

added optional user batch_size input

d8af0b4

Merge branch 'wds-schema' of https://github.com/nsaadhvi/codebase-del…

79c5cda

…tacat into wds-schema

bird classification web data set demo

3511e5f

webdataset schema finished, test cases not working

4e642a0

wds schema pytest and wds schema logic

925fa67

edits to building wds schema

3c6ced0

Initial implementation of from_webdataset

4900233

fixed error with reading tar files and added sample tar files

0e74062

finished pytests

d383439

inconsistent jsons pytest added

a7e5b32

comment out failing test

2accfe9

restored failing test case

4b239e6

create datasets in tmp_path

ccc5440

add image binary column to dataset and schema

c328f2b

resolved merge conflicts with 2.0 branch, wds tests passing

acbddd4

Throw error for image binary and batch row mismatch

1795a19

nsaadhvi and others added 5 commits May 2, 2025 11:43

edit demo to process images with binary column

d154e17

normalize function and media instead of image

5fc1ca3

merge

0a2aaf1

Co-authored-by: nsaadhvi <[email protected]>

cleanup changes

6bd5f1f

Update wds_demo.py with comments

5aac9c1

Signed-off-by: Neeral Bhalgat <[email protected]>

pdames requested changes Jun 16, 2025

View reviewed changes

flliver reviewed Jul 1, 2025

View reviewed changes

valerie-cal added 5 commits August 17, 2025 15:57

Add data read/write tests and remove internal attribute tests

ca0f846

Merge remote-tracking branch 'origin/2.0' into wds-schema

0e3623d

Fix accidental executable bit changes

860a9d2

Fix accidental executable bit changes

784b650

Manually resolve remaining merge conflicts

b8e819d

valerie-cal and others added 4 commits August 17, 2025 18:03

Addressing comments, to be continued

6a5b7df

Improvements, to be continued

ec52d2f

Cleaned and upgraded webdataset test suites

b5d638d

Removed static tar files used in old WebDataset test suite

0f270bd

025rhu force-pushed the wds-schema branch from 8e8709f to 0f270bd Compare August 18, 2025 07:20

Removed unnecessary set up code in TestFromWebDataset

1be6b29

025rhu added 5 commits August 20, 2025 22:14

Using WDS reader for tests, added intra-batch schema merge handling l…

787cf3f

…ogic and fixed media_binary logic in from_webdataset

Make a WebDatasetReader class for Dataset's from_webdataset, and adde…

c654e68

…d record count check in inconsistent schema test for the WDS reader

Upgraded inconsistent schema handling test, and added non-lossy promo…

f3401d2

…tion handling for WebDatasetReader

Cleaned code and update a doc string

90ae6e0

Passing linter

bb86af5

		from transformers import AutoImageProcessor, AutoModelForImageClassification


		#tar_path = "deltacat/tests/test_utils/resources/imagenet1k-train-0000.tar"

Multimodal Input: from_webdataset() #560

Are you sure you want to change the base?

Multimodal Input: from_webdataset() #560

Uh oh!

Conversation

valerie-cal commented Jun 15, 2025

Summary

Rationale

Changes

Testing

Uh oh!

pdames left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

025rhu Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

025rhu commented Aug 17, 2025

Uh oh!

025rhu commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pdames left a comment •

edited

Loading

025rhu Aug 18, 2025 •

edited

Loading

025rhu commented Aug 18, 2025 •

edited

Loading