diff --git a/_data/contributors.yml b/_data/contributors.yml index 8cefa3151222..42c0e5a1a6cd 100644 --- a/_data/contributors.yml +++ b/_data/contributors.yml @@ -76,4 +76,7 @@ - name: Andrew Lamb apacheId: alamb githubId: alamb +- name: Matthew Kim + apacheId: friendlymatthew # Not a real apacheId + githubId: friendlymatthew # End contributors.yml diff --git a/_posts/2025-11-25-variant.md b/_posts/2025-11-25-variant.md new file mode 100644 index 000000000000..cfdbb1570840 --- /dev/null +++ b/_posts/2025-11-25-variant.md @@ -0,0 +1,123 @@ +--- +layout: post +title: "Variant: tbd" +author: friendlymatthew +date: "2025-11-25 00:00:00" +categories: [release] +--- + + + +Variant is a data type designed to solve JSON's performance problems in OLAP systems, and its initial implementation has been released in Arrow 57.0.0. + +This article explains the limitations of JSON, introduces the Variant type that solves these problems, and explains why you should be excited about it by analyzing performance characteristics on a real world dataset. + +## Whats the problem with JSON? + +Many real world datasets are messy and lack a rigid schema. As a result, it's common to store such data as a JSON column. + +The access pattern of this model looks like: + +- **read**: for every row you scan, decode the entire JSON into memory and evaluate it +- **write**: for every record you write, encode it as its own JSON value + +There are many problems with this approach. + +_Read performance degrades quickly._ Evaluating any predicate requires scanning every row and deserializing the entire JSON payload into memory, no matter its size. Even multi-megabyte documents must be completely decoded just to inspect a single field. + +_Write efficiency is also poor._ Because each row stores an independently encoded JSON object, common object fields are redundantly serialized over and over. This increases storage size and adds unnecessary encoding overhead. + +These problems aren't new and solutions exist that improve the access patterns of a JSON column. One common solution is to extract commonly occuring object keys into a dedicated column, also known as object shredding. The challenge is that without a standardized specification, each query engine would shred JSON differently, using incompatible column naming schemas, type mappings or encoding strategies. This would force downstream systems to either re-shred the data (wasting compute and storage) or fall back to full deserialization (losing the performance benefits). + +## How Variant solves these problems + +Variant is a data type with an efficient binary encoding. It's designed to store JSON data in a way that is more performant for OLAP query engines. + +### Variant has richer data types + +JSON is limited to just six data types: strings, numbers, booleans, nulls, objects, and arrays. This simplicity comes at a cost: specialized data types like timestamps and UUIDs have to be coerced into strings at write time, and then parsed back into their original types at read time. + +Variant removes this overhead by supporting a much broader range of native types. It still has the composite data types-- arrays and objects-- but extends the primitive category to include 20 specialized types. Dates, timestamps, UUIDs, binary data, and integers and floats of various width all have their own native representations. Values are encoded in type-specific binary formats, optimized for their native type rather than a stringifed representation. This reduces both storage-overhead and query-time parsing costs. + +For example, UUIDs get a 2x size improvement by being stored as a `u128` instead of a 36 character hex string. + +_[Figure: diagram of { id: "some uuid", timestamp: ...} encoded in both JSON and Variant logically and physically]_ + +### Variant can be compactly encoded through a 2-column design + +JSON columns serialize every row independently, which leads to unnecessary encoding overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" }` thousands of times, the field names `user_id` and `timestamp` are written out in full for each row, even though they're identical across rows. + +Variant avoids such overhead by splitting data across 2 columns: `metadata` and `value`. + +The `metadata` column stores a dictionary of unique field names. The `value` column stores the actual data. When encoding an object, field names aren't written inline. Instead, each field references its name by offset position in the metadata dictionary. + +_[Figure: diagram of a primitive variant and variant object beign written to a metadata and value column]_ + +At the file level, this design enables a powerful optimization. In Parquet, for example, data is written in row groups, and each column within a row group forms a column chunk- a contiguous set of values. You can build a single metadata dictionary per column chunk, containing the union of all unique field names across every row in that chunk. Rows with different schemas both reference the same shared dictionary. This way, each unique field name appears exactly once in the dictionary, even if it's used in thousands of rows. + +_[Figure: diagram of multiple variants pointing to the same metadata dictionary]_ + +### Variant guarantees faster search performance + +When objects are encoded, field entries must be written in lexicographic order by field name. This ordering constraint enables efficient field lookups: finding a value by field name takes `O(log(n))` time, where `n` is the number of fields in the object. Without this guarantee, you'd need to sequentially scan through every field. For deeply nested objects with dozens of fields, this difference compounds quickly. + +The metadata dictionary can similarly be optimized for faster search performance by making the list of field names unique and sorted. + +### Variant can leverage file format capabilities + +Even with fast field lookups, Variant still requires full deserialization to access any field, as the entire data must be decoded just to read the single value. This wastes CPU and memory on data that you don't need. + +Variant solves this problem by standardizing a shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. By defining a common specification, a query engine that shreds Variant data into separate columns means any other Variant-compatible engine can understand and query those shredded columns directly. This interopability is crucial in modern systems where data written by Spark might be read by Datafusion for example. + +In the following example, we'll showcase the benefits when reading shredded Variants from Parquet files. + +Let's say we have a column of Variant data, and we notice the timestamp field `start_timestamp` and a 64-bit integer field called `user_id` being frequently queried: + +```sql +SELECT count(*) from events +WHERE typed_value.start_timestamp > '2025-01-01' AND typed_value.user_id = 12345; +``` + +By shredding out `start_timestamp` and `user_id` fields into dedicated timestamp and integer columns, we gain the full benefits of a Parquet native column. + +todo matthew: work on this section and flesh this out + +Pair the shredding with the benefits that a columnar file format like Parquet offers, and the execution looks like this: + +- **zone maps** check each row group's `start_timestamp` range. Row groups with `max(start_timestamp) <= '2025-01-01'` are skipped entirely. +- **bloom filters** check remaining row groups for `user_id = 12345`. Row groups without this user are skipped. + +More importantly, you only deserialize the shredded columns you need, not the Variant columns. + +Shredding makes the trade-off explicit: extract frequently queried fields into optimized columns at write time, and keep everything else flexible. You get columnar performance for common access patterns and schema flexibility for everything else. + +## Why you should be excited about Variant + +In this section, we'll explore the performance characteristics of JSON and Variant data. We'll use Clickhouse's JSON Bench [dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events), a billion-record collection of Bluesky events designed to represent real-world production data. + +Our benchmarks focus on three key metrics: write performance when serializing data to Parquet files, storage efficiency comparing both compressed and uncompressed file sizes, and query perfomance across common access patterns. + +For query execution, we use Datafusion, a popular query engine. To work with Variant in Datafusion, we use [datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant), a library that implements native Variant type support. + +Full benchmark implementation code is available [here](). + +### Write performance + +### Read performance