Skip to content

UDF support: CREATE FUNCTION DDL with pipeline SQL integration#198

Open
ryannedolan wants to merge 16 commits intomainfrom
udfs
Open

UDF support: CREATE FUNCTION DDL with pipeline SQL integration#198
ryannedolan wants to merge 16 commits intomainfrom
udfs

Conversation

@ryannedolan
Copy link
Copy Markdown
Collaborator

@ryannedolan ryannedolan commented Mar 19, 2026

Summary

End-to-end UDF support: from CREATE FUNCTION DDL through SqlJob CRD to Flink data plane execution.

  • CREATE FUNCTION DDL with clean syntax supporting Java classes, Python modules, inline code ($$...$$ dollar-quoting), file paths, URLs, and JAR archives via the AS/IN/LANGUAGE clauses
  • SqlJob as intermediate CRD: pipelines route through SqlJob instead of FlinkSessionJob directly, carrying both SQL statements and UDF files for dynamic delivery
  • FlinkRunner fetches from K8s API: the data plane receives a --sqljob=namespace/name reference and pulls SQL + files directly from the SqlJob CR at runtime (matching the production Proteus pattern)
  • SqlJob controller revived: hoptimator-flink-adapter module re-added to the build, reconciler rewritten to current K8sContext API, creates FlinkSessionJob from SqlJob
  • Demo UDFs: Java (Greet, StringLength) and Python (reverse_string) implementations baked into the Flink runner image, with Dockerfile updated for PyFlink support
  • USING JAR emission: functions with JAR archives emit Flink's USING JAR clause for dynamic classloading

DDL examples

  -- Java class (on classpath)
  CREATE FUNCTION greet AS 'com.linkedin.hoptimator.flink.runner.functions.Greet';

  -- Java class from JAR
  CREATE FUNCTION greet AS 'com.example.Greet' IN 'https://artifactory/.../udfs.jar';

  -- Python module reference
  CREATE FUNCTION foo LANGUAGE PYTHON AS 'func' IN 'module';

  -- Python inline code with dollar-quoting
  CREATE FUNCTION foo RETURNS VARCHAR LANGUAGE PYTHON AS 'foo' IN $$
  from pyflink.table.udf import udf
  from pyflink.table import DataTypes

  @udf(result_type=DataTypes.STRING())
  def foo(s):
      return s[::-1] if s else None
  $$;

  -- Python from file
  CREATE FUNCTION foo LANGUAGE PYTHON AS 'foo' IN '/path/to/my_udfs.py';

Test plan

  • Unit tests for Java UDFs (GreetTest, StringLengthTest)
  • Unit tests for FlinkRunner (FlinkRunnerTest)
  • DDL parsing tests for new syntax including dollar-quoting and LANGUAGE-before-AS (create-function-ddl.id)
  • Integration test for SqlJob generation with UDF files (k8s-ddl-udf-files.id)
  • Integration test for demo UDFs with real class names (k8s-ddl-udf-demo.id)
  • All existing integration tests updated for SqlJob output format
  • Full build passes (checkstyle, spotbugs, all tests)
  • End-to-end on cluster: SqlJob → reconciler → FlinkSessionJob → FlinkRunner fetches SqlJob → executes SQL with working UDFs

🤖 Generated with Claude Code

ryannedolan and others added 16 commits March 19, 2026 16:24
Add support for user-defined functions (UDFs) that can be registered via
CREATE FUNCTION and referenced in SQL queries. Registered functions are
included in pipeline SQL so Flink can execute them at runtime.

DDL syntax:
  CREATE FUNCTION name [RETURNS type] AS 'class' [LANGUAGE lang] [WITH (...)]
  DROP FUNCTION name

Phase 1 - JDBC driver + pipeline SQL:
- UserFunction API model (Deployable)
- OpaqueFunction: permissive ScalarFunction for Calcite validation with
  configurable return type (RETURNS clause) and ANY-typed parameters
- Session-scoped function registry on HoptimatorConnection
- CREATE/DROP FUNCTION handling in HoptimatorDdlExecutor
- FunctionImplementor in ScriptImplementor generates CREATE FUNCTION DDL
- PipelineRel.Implementor tracks functions and emits DDL before connectors
- Parser extended with RETURNS and LANGUAGE clauses

Phase 2 - Python code delivery:
- Job API gains files field for inline code (e.g., Python UDF sources)
- SqlJob CRD spec gains files field
- FlinkStreamingSqlJob and reconciler pass files through
- K8sJobDeployer exports files to template environment

Tests:
- ScriptImplementorTest: FunctionImplementor DDL generation
- Quidem unit test (create-function-ddl.id): DDL parsing, type validation
- Quidem integration test (k8s-ddl-udf.id): pipeline SQL with !specify

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n functions

Calcite normalizes identifiers to uppercase, and all session-registered
functions are emitted in pipeline SQL (not just the one used in the query).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Java and Python UDF implementations baked into the Flink runner
image so CREATE FUNCTION DDL resolves real functions at runtime:

- Greet: scalar VARCHAR UDF (Java)
- StringLength: scalar INTEGER UDF (Java)
- reverse_string: scalar VARCHAR UDF (Python/PyFlink)

Update Dockerfile to install Python/PyFlink and copy Python UDFs.
Configure Flink session cluster with Python executable paths.
Add k8s-ddl-udf-demo.id integration test using real UDF class names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change flink-template.yaml to generate SqlJob instead of FlinkSessionJob
directly, so that UDF files (Python code) are bundled into the CRD and
can be dynamically delivered to the data plane.

- flink-template.yaml now generates SqlJob with sql + files fields
- FlinkStreamingSqlJob.yaml.template changed to FlinkSessionJob (session mode)
- FlinkStreamingSqlJob encodes files as --file: directives in sql args
- FlinkRunner parses --file: directives, writes to /opt/python-udfs/
- FlinkControllerProvider registers FlinkSessionJob API
- All integration test expected output updated from FlinkSessionJob to SqlJob

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in venice-ddl-insert-partial.id: take main's updated
SQL with multiple key fields (;-delimiter fix from #199) in SqlJob format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The flink-adapter module was orphaned from the build (not in
settings.gradle). Revive it so the SqlJob -> FlinkSessionJob
reconciler is compiled, packaged, and deployed with the operator.

- Add hoptimator-flink-adapter to settings.gradle
- Add as runtimeOnly dependency in hoptimator-operator-integration
  (discovered via SPI ControllerProvider)
- Rewrite FlinkControllerProvider and FlinkStreamingSqlJobReconciler
  to use current K8sContext/K8sApi pattern (was using old Operator API)
- Fix build.gradle dependency aliases (libs.kubernetes.client)
- Add hoptimator-util dependency for Api interface

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded absolute path with System.getProperty fallback
to satisfy SpotBugs DMI_HARDCODED_ABSOLUTE_FILENAME check.
Configurable via -Dhoptimator.udf.dir, defaults to /opt/python-udfs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The template engine renders an empty map as blank string, not {}.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SnakeYAML dumps an empty map as "{}\n", which the template engine
renders as an indented {} on a separate line after "files:".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SnakeYAML's dump() appends a trailing newline to its output (e.g.,
"{}\n" for an empty map). The template engine's multiline expansion
converts this into a spurious whitespace-only line. Trimming the
output fixes the rendering of {{files}} and other map variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the --file: encoding mechanism with the production pattern:
FlinkRunner receives --sqljob=namespace/name and fetches the SqlJob
CR directly from the K8s API to get SQL statements and UDF files.

- FlinkRunner uses DynamicKubernetesApi to fetch SqlJob CR
- Extracts spec.sql (statements) and spec.files (UDF code)
- Writes files to UDF directory, then executes SQL
- Falls back to SQL-from-args for backward compatibility
- Reconciler simplified: just passes SqlJob reference to template
- FlinkStreamingSqlJob reduced to namespace+name export
- RBAC added for Flink SA to read SqlJob CRs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests that CREATE FUNCTION with CODE option produces a SqlJob
containing the files map. The inline Python code is extracted
from the CODE option and mapped to a filename derived from the
AS clause (e.g., 'my_udf.transform' -> 'my_udf.py').

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New syntax: CREATE FUNCTION foo [RETURNS type] [LANGUAGE lang] AS 'callable' [IN <source>]

- LANGUAGE moves before AS for natural reading
- AS names the callable entry point (class for Java, function for Python)
- IN provides the source: module name or $$-delimited inline code
- $$...$$  dollar-quoting avoids escaping issues for code blocks

Implementation:
- DollarQuoting preprocessor converts $$...$$ to single-quoted strings
  before Calcite parses the SQL (hooked into PARSER_FACTORY)
- Grammar updated in parserImpls.ftl and generated parser
- DDL executor: when LANGUAGE + IN present, auto-detects inline code
  (whitespace) vs module reference, derives module.function path

Examples:
  -- Java class reference (unchanged)
  CREATE FUNCTION greet AS 'com.example.Greet';
  -- Python module reference
  CREATE FUNCTION foo LANGUAGE PYTHON AS 'func' IN 'module';
  -- Python inline code with dollar-quoting
  CREATE FUNCTION foo LANGUAGE PYTHON AS 'foo' IN $$
  def foo(s):
      return s[::-1]
  $$;

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The IN clause now accepts three forms:
  - Inline code:     IN $$def foo(s): ...$$
  - File path:       IN '/path/to/udf.py'
  - URL:             IN 'file:///path/to/udf.py'

File paths and URLs are detected by the presence of '/' or ':\',
read at DDL execution time, and stored as CODE in the function
options. The module.function reference is derived from the function
name, same as inline code.

Examples:
  CREATE FUNCTION foo LANGUAGE PYTHON AS 'foo' IN '/tmp/my_udfs.py';
  CREATE FUNCTION foo LANGUAGE PYTHON AS 'foo' IN 'file:///tmp/my_udfs.py';

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a function's IN clause points to a JAR archive (detected by .jar
extension), the JAR location is stored in the function options and
emitted as USING JAR in the generated Flink SQL:

  CREATE FUNCTION greet AS 'com.example.Greet' USING JAR '/path/to/udfs.jar';

This enables Flink to dynamically load UDF classes from external JARs.
JAR references are also included in the SqlJob files map so the data
plane can fetch them.

Example DDL:
  CREATE FUNCTION greet AS 'com.example.Greet' IN 'https://artifactory/udfs-1.0.jar';
  CREATE FUNCTION greet AS 'com.example.Greet' IN '/opt/libs/udfs.jar';

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ryannedolan ryannedolan marked this pull request as ready for review March 26, 2026 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant