UDF support: CREATE FUNCTION DDL with pipeline SQL integration by ryannedolan · Pull Request #198 · linkedin/Hoptimator

ryannedolan · 2026-03-19T23:39:55Z

Summary

End-to-end UDF support: from CREATE FUNCTION DDL through SqlJob CRD to Flink data plane execution.

CREATE FUNCTION DDL with clean syntax supporting Java classes, Python modules, inline code ($$...$$ dollar-quoting), file paths, URLs, and JAR archives via the AS/IN/LANGUAGE clauses
SqlJob as intermediate CRD: pipelines route through SqlJob instead of FlinkSessionJob directly, carrying both SQL statements and UDF files for dynamic delivery
FlinkRunner fetches from K8s API: the data plane receives a --sqljob=namespace/name reference and pulls SQL + files directly from the SqlJob CR at runtime (matching the production Proteus pattern)
SqlJob controller revived: hoptimator-flink-adapter module re-added to the build, reconciler rewritten to current K8sContext API, creates FlinkSessionJob from SqlJob
Demo UDFs: Java (Greet, StringLength) and Python (reverse_string) implementations baked into the Flink runner image, with Dockerfile updated for PyFlink support
USING JAR emission: functions with JAR archives emit Flink's USING JAR clause for dynamic classloading

DDL examples

  -- Java class (on classpath)
  CREATE FUNCTION greet AS 'com.linkedin.hoptimator.flink.runner.functions.Greet';

  -- Java class from JAR
  CREATE FUNCTION greet AS 'com.example.Greet' IN 'https://artifactory/.../udfs.jar';

  -- Python module reference
  CREATE FUNCTION foo LANGUAGE PYTHON AS 'func' IN 'module';

  -- Python inline code with dollar-quoting
  CREATE FUNCTION foo RETURNS VARCHAR LANGUAGE PYTHON AS 'foo' IN $$
  from pyflink.table.udf import udf
  from pyflink.table import DataTypes

  @udf(result_type=DataTypes.STRING())
  def foo(s):
      return s[::-1] if s else None
  $$;

  -- Python from file
  CREATE FUNCTION foo LANGUAGE PYTHON AS 'foo' IN '/path/to/my_udfs.py';

Test plan

Unit tests for Java UDFs (GreetTest, StringLengthTest)
Unit tests for FlinkRunner (FlinkRunnerTest)
DDL parsing tests for new syntax including dollar-quoting and LANGUAGE-before-AS (create-function-ddl.id)
Integration test for SqlJob generation with UDF files (k8s-ddl-udf-files.id)
Integration test for demo UDFs with real class names (k8s-ddl-udf-demo.id)
All existing integration tests updated for SqlJob output format
Full build passes (checkstyle, spotbugs, all tests)
End-to-end on cluster: SqlJob → reconciler → FlinkSessionJob → FlinkRunner fetches SqlJob → executes SQL with working UDFs

🤖 Generated with Claude Code

Add support for user-defined functions (UDFs) that can be registered via CREATE FUNCTION and referenced in SQL queries. Registered functions are included in pipeline SQL so Flink can execute them at runtime. DDL syntax: CREATE FUNCTION name [RETURNS type] AS 'class' [LANGUAGE lang] [WITH (...)] DROP FUNCTION name Phase 1 - JDBC driver + pipeline SQL: - UserFunction API model (Deployable) - OpaqueFunction: permissive ScalarFunction for Calcite validation with configurable return type (RETURNS clause) and ANY-typed parameters - Session-scoped function registry on HoptimatorConnection - CREATE/DROP FUNCTION handling in HoptimatorDdlExecutor - FunctionImplementor in ScriptImplementor generates CREATE FUNCTION DDL - PipelineRel.Implementor tracks functions and emits DDL before connectors - Parser extended with RETURNS and LANGUAGE clauses Phase 2 - Python code delivery: - Job API gains files field for inline code (e.g., Python UDF sources) - SqlJob CRD spec gains files field - FlinkStreamingSqlJob and reconciler pass files through - K8sJobDeployer exports files to template environment Tests: - ScriptImplementorTest: FunctionImplementor DDL generation - Quidem unit test (create-function-ddl.id): DDL parsing, type validation - Quidem integration test (k8s-ddl-udf.id): pipeline SQL with !specify Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n functions Calcite normalizes identifiers to uppercase, and all session-registered functions are emitted in pipeline SQL (not just the one used in the query). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Java and Python UDF implementations baked into the Flink runner image so CREATE FUNCTION DDL resolves real functions at runtime: - Greet: scalar VARCHAR UDF (Java) - StringLength: scalar INTEGER UDF (Java) - reverse_string: scalar VARCHAR UDF (Python/PyFlink) Update Dockerfile to install Python/PyFlink and copy Python UDFs. Configure Flink session cluster with Python executable paths. Add k8s-ddl-udf-demo.id integration test using real UDF class names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Change flink-template.yaml to generate SqlJob instead of FlinkSessionJob directly, so that UDF files (Python code) are bundled into the CRD and can be dynamically delivered to the data plane. - flink-template.yaml now generates SqlJob with sql + files fields - FlinkStreamingSqlJob.yaml.template changed to FlinkSessionJob (session mode) - FlinkStreamingSqlJob encodes files as --file: directives in sql args - FlinkRunner parses --file: directives, writes to /opt/python-udfs/ - FlinkControllerProvider registers FlinkSessionJob API - All integration test expected output updated from FlinkSessionJob to SqlJob Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolve conflict in venice-ddl-insert-partial.id: take main's updated SQL with multiple key fields (;-delimiter fix from #199) in SqlJob format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The flink-adapter module was orphaned from the build (not in settings.gradle). Revive it so the SqlJob -> FlinkSessionJob reconciler is compiled, packaged, and deployed with the operator. - Add hoptimator-flink-adapter to settings.gradle - Add as runtimeOnly dependency in hoptimator-operator-integration (discovered via SPI ControllerProvider) - Rewrite FlinkControllerProvider and FlinkStreamingSqlJobReconciler to use current K8sContext/K8sApi pattern (was using old Operator API) - Fix build.gradle dependency aliases (libs.kubernetes.client) - Add hoptimator-util dependency for Api interface Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace hardcoded absolute path with System.getProperty fallback to satisfy SpotBugs DMI_HARDCODED_ABSOLUTE_FILENAME check. Configurable via -Dhoptimator.udf.dir, defaults to /opt/python-udfs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The template engine renders an empty map as blank string, not {}. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SnakeYAML dumps an empty map as "{}\n", which the template engine renders as an indented {} on a separate line after "files:". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SnakeYAML's dump() appends a trailing newline to its output (e.g., "{}\n" for an empty map). The template engine's multiline expansion converts this into a spurious whitespace-only line. Trimming the output fixes the rendering of {{files}} and other map variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the --file: encoding mechanism with the production pattern: FlinkRunner receives --sqljob=namespace/name and fetches the SqlJob CR directly from the K8s API to get SQL statements and UDF files. - FlinkRunner uses DynamicKubernetesApi to fetch SqlJob CR - Extracts spec.sql (statements) and spec.files (UDF code) - Writes files to UDF directory, then executes SQL - Falls back to SQL-from-args for backward compatibility - Reconciler simplified: just passes SqlJob reference to template - FlinkStreamingSqlJob reduced to namespace+name export - RBAC added for Flink SA to read SqlJob CRs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests that CREATE FUNCTION with CODE option produces a SqlJob containing the files map. The inline Python code is extracted from the CODE option and mapped to a filename derived from the AS clause (e.g., 'my_udf.transform' -> 'my_udf.py'). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New syntax: CREATE FUNCTION foo [RETURNS type] [LANGUAGE lang] AS 'callable' [IN <source>] - LANGUAGE moves before AS for natural reading - AS names the callable entry point (class for Java, function for Python) - IN provides the source: module name or $$-delimited inline code - $$...$$ dollar-quoting avoids escaping issues for code blocks Implementation: - DollarQuoting preprocessor converts $$...$$ to single-quoted strings before Calcite parses the SQL (hooked into PARSER_FACTORY) - Grammar updated in parserImpls.ftl and generated parser - DDL executor: when LANGUAGE + IN present, auto-detects inline code (whitespace) vs module reference, derives module.function path Examples: -- Java class reference (unchanged) CREATE FUNCTION greet AS 'com.example.Greet'; -- Python module reference CREATE FUNCTION foo LANGUAGE PYTHON AS 'func' IN 'module'; -- Python inline code with dollar-quoting CREATE FUNCTION foo LANGUAGE PYTHON AS 'foo' IN $$ def foo(s): return s[::-1] $$; Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The IN clause now accepts three forms: - Inline code: IN $$def foo(s): ...$$ - File path: IN '/path/to/udf.py' - URL: IN 'file:///path/to/udf.py' File paths and URLs are detected by the presence of '/' or ':\', read at DDL execution time, and stored as CODE in the function options. The module.function reference is derived from the function name, same as inline code. Examples: CREATE FUNCTION foo LANGUAGE PYTHON AS 'foo' IN '/tmp/my_udfs.py'; CREATE FUNCTION foo LANGUAGE PYTHON AS 'foo' IN 'file:///tmp/my_udfs.py'; Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a function's IN clause points to a JAR archive (detected by .jar extension), the JAR location is stored in the function options and emitted as USING JAR in the generated Flink SQL: CREATE FUNCTION greet AS 'com.example.Greet' USING JAR '/path/to/udfs.jar'; This enables Flink to dynamically load UDF classes from external JARs. JAR references are also included in the SqlJob files map so the data plane can fetch them. Example DDL: CREATE FUNCTION greet AS 'com.example.Greet' IN 'https://artifactory/udfs-1.0.jar'; CREATE FUNCTION greet AS 'com.example.Greet' IN '/opt/libs/udfs.jar'; Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ryannedolan and others added 16 commits March 19, 2026 16:24

Merge origin/main into udfs

e0dca08

Resolve conflict in venice-ddl-insert-partial.id: take main's updated SQL with multiple key fields (;-delimiter fix from #199) in SqlJob format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix integration test expected output: empty files renders as blank

bd404b7

The template engine renders an empty map as blank string, not {}. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix integration test: empty files map renders as indented {}

768ec26

SnakeYAML dumps an empty map as "{}\n", which the template engine renders as an indented {} on a separate line after "files:". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into udfs

61965bf

ryannedolan marked this pull request as ready for review March 26, 2026 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDF support: CREATE FUNCTION DDL with pipeline SQL integration#198

UDF support: CREATE FUNCTION DDL with pipeline SQL integration#198
ryannedolan wants to merge 16 commits intomainfrom
udfs

ryannedolan commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryannedolan commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

DDL examples

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryannedolan commented Mar 19, 2026 •

edited

Loading