-
Notifications
You must be signed in to change notification settings - Fork 40
feat(r/sedonadb): Translate R expressions to DataFusion logical Expr #468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements expression translation functionality for the SedonaDB R package, enabling R expressions to be converted into DataFusion logical expressions. This is a foundational feature for supporting R-style query syntax in SedonaDB.
Key changes:
- Added expression infrastructure including
SedonaDBExprandSedonaDBExprFactorytypes - Implemented R expression evaluation system with translation support for common operations
- Added literal conversion system with support for basic R types and geometry types
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| r/sedonadb/src/rust/src/expression.rs | Implements core Rust expression types and factory methods for creating DataFusion expressions |
| r/sedonadb/src/rust/src/ffi.rs | Adds FFI helper functions for importing Arrow arrays and fields from R |
| r/sedonadb/R/expression.R | Implements R-side expression factory, translation system, and operator mappings |
| r/sedonadb/R/literal.R | Provides S3 generic system for converting R objects to SedonaDB literal expressions |
| r/sedonadb/tests/testthat/test-expression.R | Test coverage for expression translation functionality |
| r/sedonadb/tests/testthat/test-literal.R | Test coverage for literal conversion |
| r/sedonadb/tools/savvy-update.sh | Build utility script for updating savvy-generated code |
| r/sedonadb/src/rust/api.h | Generated C API declarations for new expression types |
| r/sedonadb/src/init.c | Generated C initialization code for new expression functions |
| r/sedonadb/R/000-wrappers.R | Generated R wrapper functions for Rust expression types |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
| # Add license header, put includes on their own lines, and fix a typo in init.c | ||
| echo "${LICENSE_C}" > "${init_c}" | ||
| sed 's/#include/\n#include/g' "${init_c}.tmp" | \ | ||
| sed '1s/^\n//' | \ | ||
| sed 's/initialzation/initialization/g' >> "${init_c}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yutannihilation I will open an issue and/or try to fix this in savvy when I have a moment, but while I'm thinking about it:
- Includes have to go on their own lines, or else clang-format will reorder them in a way such that they won't compile
- There's a typo in one of the comments that causes our pre-commit to fail ("initialzation")
...the other stuff (license files, clang-format) is specific to our setup and I'm not sure that's savvy's issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding license files, I found bindgen has option to insert arbitrary strings at the top of the generated file. Will it be useful if I implement this in savvy-cli?
bindgen input.h -o bindings.rs \
--raw-line "/*" \
--raw-line " * Copyright (c) 2026 Your Name" \
--raw-line " * Licensed under Apache-2.0" \
--raw-line " */"| sd_expr_column <- function(column_name, qualifier = NULL, factory = sd_expr_factory()) { | ||
| factory$column(column_name, qualifier) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@e-kotov I tried to mirror the DuckDB expression constructors you linked! The names are slightly different because I used DataFusion naming conventions.
| // This seems to require $.ptr from the list() input (can't just | ||
| // use list of R SedonaDBExpr objects) | ||
| let expr_wrapper: &SedonaDBExpr = item.try_into()?; | ||
| Ok(expr_wrapper.inner.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yutannihilation Is there a trick here that I should be using to get a list of a #[savvy] struct XXX back into Rust land?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remember if there's anything, but I think there's no trick...
Regarding $.ptr, it's posible to return a bare EXTPTRSXP without a wrapper, but it requres you to add several wrapper R functions (e.g. add S3 class, and print it nicely). So, I'm not sure which is easier.
cf. https://yutannihilation.github.io/savvy/guide/struct.html#external-pointer
This PR adds constructors for DataFusion/SedonaDB (logical)
Exprand a framework for constructing the using an R expression. I also added a few function translations to jog my memory on the corner cases with respect to constructing them. This is to support #474 (i.e., a stepping stone to help implement dplyr verbs).This implementation comes from the work I did in arrow ( https://github.com/apache/arrow/blob/main/r/R/dplyr-funcs.R ) and substrait-r ( https://github.com/voltrondata/substrait-r/blob/e3045d21b1765c15d2b774409f4ca2b1ebb01796/R/compiler.R#L448-L466 ) and the suggestions in #474 with respect to how this is organized in duckdb-r/duckplyr.
In arrow and substrait-r we used a slightly more straightforward approach than this PR, which was basically to load an environment with "override" functions and field references. When those packages evaluate an expression, we arrange the environments such that column reference expressions and our "translated" functions (that accept expressions and return expressions) are resolved first. This mostly works but:
/, which in R is never floor division, we need to check if both arguments are integers (and if so, cast one of them to a float).For this reason, I introduced a
sd_expr_ctx()(expression context) here that is faithfully passed down when evaluating translated expressions. This gives each translation access to things like the input schema (so that the output type of the expression can be computed), session options when we have them, and information like grouping expressions for aggregate functions.This means that evaluation is a bit more "manual", but I think the extra control is worth it (although we can revisit as we implement more translations and I remember why various things are hard). We could accomplish some of this with a global variable (like the global
substrait_compiler()in substrait-r) and less complicated evaluation as well.There is a lot to do after this PR, like add support for more functions, more types of literals, and wire up actual dplyr verbs like
filter().