Skip to content

[DSIP-103][Migration]Support zero-downtime cross-major-version migration using Flink-CDC #17835

@victorsheng

Description

@victorsheng

Search before asking

  • I had searched in the DSIP and found no similar DSIP.

Motivation

Currently, upgrading Apache DolphinScheduler between major versions (e.g., from 1.3.x to 3.x.x) relies on the official upgrade-schema.sh script. This approach has several limitations for large-scale production environments:

  • Downtime Requirement: The master/worker nodes and the metadata database must be offline during the schema upgrade, which is unacceptable for 24/7 SLA requirements.
  • All-or-Nothing Risk: It is impossible to migrate only a subset of projects. If an upgrade fails, rolling back a massive database is time-consuming and risky.
  • Schema Complexity: Major versions (especially the jump from 1.x to 2.x/3.x) introduced significant changes, such as the decoupling of task and process definitions.

Using Flink-CDC as a migration engine allows for real-time metadata synchronization, gradual "canary" migrations of specific workflows, and zero downtime for the source system.

Design Detail

The migration tool will be implemented as a Flink application that captures changes from the source metadata database and sinks them into the target database after applying version-specific transformations.

  1. Architecture:
  • Source: MySQL/PostgreSQL (Source DS Database) using Flink CDC Connectors.
  • Transformation Layer: A custom MapFunction or ProcessFunction that handles the schema mapping logic. For example:
    • Converting the process_definition_json in 1.3.x into the decoupled task_definition and task_relation in 3.x.x.
    • Generating new snowflake IDs (Global IDs) for the target version.
  • Sink: JDBC Sink (Target DS Database).
  1. Key Components:
  • Granular Filter: A configuration parameter (e.g., migration.project.codes) to allow users to select specific projects for migration.
  • Stateful Mapping: Use Flink State to maintain the mapping between old IDs and new IDs to ensure consistency across multiple tables.
  • Data Conversion Engine: A dedicated module to parse 1.x JSON strings and reconstruct them into the target version's relational model.

Compatibility, Deprecation, and Migration Plan

  • Compatibility: This feature is an alternative migration path and does not replace the existing upgrade-schema.sh.
    • It supports "Source-Live" mode, where the source system remains read-write while the target system is being populated.
  • Deprecation: None.
  • Migration Plan:
    • Deploy the Target DolphinScheduler version (fresh install).
    • Configure and start the Flink-CDC migration job.
    • Perform verification on the Target environment (e.g., dry-run workflows).
    • Gradually switch the scheduling traffic from Source to Target by project.
    • Stop the CDC job once all projects are migrated.

Test Plan

  • Unit Tests:
    • Validate the JSON transformation logic from 1.3.x to 3.x.x.
    • Test the ID generator and mapping state.
  • Integration Tests:
    • End-to-end migration from a standard DS 1.3.5 database to a DS 3.2.2 database.
    • Verify workflow execution on the target side after migration.
  • Consistency Tests:
    • Compare the MD5 of process definitions between source and target.
    • Validate record counts across all core tables (t_ds_project, t_ds_process_definition, etc.).
  • Performance Tests:
    • Benchmark the migration speed for environments with >10,000 workflow definitions.

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions