Release 1.19 lyft by maheepm-lyft · Pull Request #83 · lyft/flink

maheepm-lyft · 2026-01-29T16:20:31Z

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

The TaskInfo is stored in the blob store on job creation time as a persistent artifact
Deployments RPC transmits only the blob storage reference
TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

…MemoryError in JDK21 This test started to fail quite regularly in JDK21. The problem was that the low heap size could have caused an OutOfMemoryError to appear when compiling the dummy classes. An OOM in the compilation phase results in a different error message being printed to stdout that wasn't captured by the test. The solution is to pre-compile the classes upfront (with the normal heap size). The test main method will only load the classes. No compilation is necessary.

…ated tests of DefaultSlotStatusSyncerTest Also deduplicate the code of these tests.

…ved before allocation success

This closes apache#24341.

…nnector (apache#24342)

…etimes

(cherry picked from commit 95163e8)

…mentation (apache#24355)

…column expansion

…ob is suspended during Restarting phase

suspend and cancel reset the ExecutionGraph in a similar way. I move the common logic into its own method to make this more prominent in the code.

…out for AdaptiveSchedulerTest (apache#24400)

This closes apache#24403.

…Builder.cleanupInRocksdbCompactFilter

…3.11 wheel package This closes apache#24439.

…iptorGroup out of the RPC main thread]" This reverts commit d18a4bf. (cherry picked from commit 7a709bf)

…hod construct resource has exception. This closes apache#24462.

…n.time.Time

…ogs when the task exited. (apache#23922)

…used by multiple writes to the same sink table and shared staging directory This closes apache#24492 * Fix unstable TableSourceITCase#testTableHintWithLogicalTableScanReuse * Moves the staging dir configuration into builder for easier testing --------- Co-authored-by: Matthias Pohl <matthias.pohl@aiven.io> (cherry picked from commit 7d0111d)

…lerTest.testOnReadBufferRequestError

This closes apache#24505

This closes apache#26479

(cherry picked from commit 3342c23)

So far, we used a special value for the final checkpoint on endInput. However, as shown in the description of this ticket, final doesn't mean final. Hence, multiple committables with EOI could be created at different times. With this commit, we stop using a special value for such committables and instead try to guess the checkpoint id of the next checkpoint. There are various factors that influence the checkpoint id but we can mostly ignore them all because we just need to pick a checkpoint id that is - higher than all checkpoint ids of the previous, successful checkpoints of this attempt - higher than the checkpoint id of the restored checkpoint - lower than any future checkpoint id. Hence, we just remember the last observed checkpoint id (initialized with max(0, restored id)), and use last id + 1 for endInput. Naturally, multiple endInput calls happening through restarts will result in unique checkpoint ids. Note that aborted checkpoints before endInput may result in diverged checkpoint ids across subtasks. However, each of the id satisfies above requirements and any id of endInput1 will be smaller than any id of endInput2. Thus, diverged checkpoint ids will not impact correctness at all. (cherry picked from commit 9302545)

Co-authored-by: Ferenc Csaky <fcsaky@apache.org>

… literal without seconds (apache#26555)

…in adaptive scheduler Also enable this strategy by default via the introduced config option

…on via `CompiledPlan`

Co-authored-by: Matthias Pohl <github@mapohl.com>

…in GHA (apache#26593)

…sabled for all connections unexpectedly

…Y type by evaluating Unsafe.arrayBaseOffset(byte[].class) in TM rather than in JM (apache#26592) Fix HashPartitioner codegen for BINARY/VARBINARY type by evaluating BYTE_ARRAY_BASE_OFFSET in TM instead of JM. The issue is, if JM memory is set > 32G while TM memory is set < 32G, this causes JVM to treat the JAVA process > 32G as large heap JVM. This can impact Unsafe behavior. For eg: UNSAFE.arrayBaseOffset(byte[].class) will return 24 for large heap JVM while 16 for others. Due to this, the tasks that run on TM (<32 G while JM > 32G or vice versa) that try to read the byte[] for MurmurHash read wrong memory locations. Signed-off-by: Jiangjie (Becket) Qin <becket.qin@gmail.com>

…XPLAIN

…es in batch mode (apache#27016) In apache#26433, we removed the EOI marker in the form of Long.MAX_VALUE as the checkpoint id. Since streaming pipelines can continue to checkpoint even after their respective operators have been shut down, it is not safe to use a constant as this can lead to duplicate commits. However, in batch pipelines we only have one commit on job shutdown. Using any checkpoint id should suffice in this scenario. Any pending committables should be processed by the ComitterOperator when the operator shuts down. No further checkpoints will take place. There are various connectors which rely on this behavior. I don't see any drawbacks from keeping this behavior for batch pipelines.

If a resource is lazily created in open, we can only close after checking for null. Otherwise a failure during initialization will trigger secondary failures.

…eckpoint notification is delayed (apache#27157)

XComp and others added 30 commits February 19, 2024 16:43

[hotfix][test] Assert the slot allocation eventually succeed in dedic…

8cf2996

…ated tests of DefaultSlotStatusSyncerTest Also deduplicate the code of these tests.

[FLINK-34434][slotmanager] Complete the returnedFuture when slot remo…

45d4dc1

…ved before allocation success

[hotfix][docs] Integrate mongodb v1.1 docs

4f7cc9f

This closes apache#24341.

[hotfix][docs] Update the versions of mongodb supported by mongodb-co…

a9bec20

…nnector (apache#24342)

[hotfix][tests] Disables cool down phase for faster test execution

b7ea090

[FLINK-34336][test] Fix the bug that AutoRescalingITCase may hang som…

c91029b

…etimes

[FLINK-34202][python] Optimize Python nightly CI time (apache#24321)

a25fca9

(cherry picked from commit 95163e8)

[FLINK-34479][documentation] Fix missed changelog configs in the docu…

d19e886

…mentation (apache#24355)

[FLINK-34476][table-planner] Consider assignment operator during TVF …

f21ee01

…column expansion

[FLINK-34496] Break circular dependency in static initialization

dd77ee5

[FLINK-34265][doc] Add the doc of named parameters (apache#24377)

0af2540

[FLINK-34518][runtime] Fixes AdaptiveScheduler#suspend bug when the j…

d743ee3

…ob is suspended during Restarting phase

[hotfix][runtime] Refactors suspend and cancel logic

62d1b8f

suspend and cancel reset the ExecutionGraph in a similar way. I move the common logic into its own method to make this more prominent in the code.

[BP-1.19][FLINK-34274][runtime] Implicitly disable resource wait time…

3c04316

…out for AdaptiveSchedulerTest (apache#24400)

[FLINK-34498] GSFileSystemFactory should not log full Flink config

628ae78

[FLINK-34499] Configuration#toString hides sensitive values

5016325

[FLINK-33436][docs] Add the docs of built-in async-profiler

12ea64c

This closes apache#24403.

[FLINK-34522][core] Changing the Time to Duration for StateTtlConfig.…

161defe

…Builder.cleanupInRocksdbCompactFilter

[hotfix] Fix the StateTtlConfig#newBuilder doc from Time to Duration

697d2b6

[Hotfix] Fix Duration class can't load for pyflink

7618bde

[FLINK-34582][realse][python] Updates cibuildwheel to support cpython…

fa738bb

…3.11 wheel package This closes apache#24439.

Revert "[FLINK-33532][network] Move the serialization of ShuffleDescr…

837f8e5

…iptorGroup out of the RPC main thread]" This reverts commit d18a4bf. (cherry picked from commit 7a709bf)

[FLINK-34616][python] Fix python dist dir doesn't clean when open met…

75c88fa

…hod construct resource has exception. This closes apache#24462.

[FLINK-34622][docs] fix typo in execution_mode.md

0a85a08

[FLINK-34617][docs] Correct the Javadoc of org.apache.flink.api.commo…

c6d96b7

…n.time.Time

[FLINK-33798][statebackend/rocksdb] automatically clean up rocksdb l…

943d9a4

…ogs when the task exited. (apache#23922)

[FLINK-34571][test] Fix flaky test SortMergeResultPartitionReadSchedu…

4f7f6a9

…lerTest.testOnReadBufferRequestError

[FLINK-34593][release] Add release note for version 1.19

511814b

This closes apache#24505

Jiabao-Sun and others added 30 commits April 18, 2025 15:21

[FLINK-37197][docs] Update MongoDB docs for 2.0.0 release

e41211d

This closes apache#26479

[hotfix] Use jdbc connector v3.3 in docs (apache#26494)

6989c1e

[FLINK-37605][runtime] Clarify contract of endInput

f869ae8

(cherry picked from commit 3342c23)

[BP-1.19][FLINK-37760][format] Bump parquet from 1.15.1 to 1.15.2

0d7cbc1

Co-authored-by: Ferenc Csaky <fcsaky@apache.org>

[FLINK-37783] Auto-disable buffer debloating for hybrid shuffle.

92b0e80

[FLINK-37803][table] Fix SQL serialization when using LocalTime value…

62f6c0f

… literal without seconds (apache#26555)

[FLINK-33977][runtime] Support minimize TM number during downscaling …

8348f7d

…in adaptive scheduler Also enable this strategy by default via the introduced config option

[FLINK-37820][table-planner] Support AsyncScalarFunction registrati…

e9ff7bd

…on via `CompiledPlan`

[FLINK-34487][ci] Adds Python Wheels nightly GHA workflow

0b1b88b

Co-authored-by: Matthias Pohl <github@mapohl.com>

[FLINK-34582] Updates cibuildwheel to support cpython 3.11 wheel package

ecbacfd

[FLINK-37804][python][build] Fix build mac wheels error on Python3.8 …

858008f

…in GHA (apache#26593)

[hotfix] Bump PyFlink grpcio package to fix build

278782e

[hotfix] Adapt docs and checks to bumped grpcio version

99c4055

[FLINK-37870][checkpoint] Fix the bug that unaligned checkpoint is di…

b55bba0

…sabled for all connections unexpectedly

[hotfix] Exclude numpy 2.3.0 in PyFlink, cause it fails wheel build

5951536

[FLINK-37946][backport][doc] fix sql syntax of page Table API & SQL E…

15c5517

…XPLAIN

[hotfix][docs] Update docs to the latest 1.19 version

dd865ac

[FLINK-38092][release] Bump japicmp configuration post 1.19.3

e40bdbd

[FLINK-38143][python] Fix pyflink flat YAML based config support

e92ba2d

[FLINK-35556] Fix constant in RocksDBSharedResourcesFactory

914c1a4

[FLINK-35556] Harden RocksDBSharedResourcesFactoryTest

9ecbcad

[FLINK-38486] Harden shutdown of system UDFs

16a9d2f

If a resource is lazily created in open, we can only close after checking for null. Otherwise a failure during initialization will trigger secondary failures.

[FLINK-38574][checkpoint] Avoid reusing re-uploaded sst files when ch…

bd916bd

…eckpoint notification is delayed (apache#27157)

bruh

9c2c431

python version in cfg

6f29508

bruh

6e2582e

more setup.py stuff

0e3f9fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.19 lyft#83

Release 1.19 lyft#83
maheepm-lyft wants to merge 2355 commits intorelease-1.17-lyftfrom
release-1.19-lyft

maheepm-lyft commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

maheepm-lyft commented Jan 29, 2026

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants