Kafka Connect: Tolerate CommitFailedException and InvalidProducerEpochException during rebalance by kumarpritam863 · Pull Request #16366 · apache/iceberg

kumarpritam863 · 2026-05-16T14:32:17Z

Summary

When a Kafka consumer group re-balance happens between the time the iceberg-kafka-connect
sink task prepares a transactional offset commit and the time the broker processes it, the
connector currently dies with an unrecoverable ConnectException:

org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
...
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Transaction offset Commit
failed due to consumer group metadata mismatch: Specified group generation id is not valid.
at org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(...)

This is a transient, expected failure mode under re-balance — the consumer-group
generation id captured in producer.sendOffsetsToTransaction(...) becomes stale by the time
the broker validates it. The same applies to InvalidProducerEpochException when the
producer epoch is bumped mid-flight. Both should be recoverable, not fatal.

Root cause

In Channel.send() the flow is:

producer.beginTransaction();
recordList.forEach(producer::send);
producer.sendOffsetsToTransaction(offsetsToCommit, KafkaUtils.consumerGroupMetadata(context));
producer.commitTransaction();   // <- raises CommitFailedException on stale generation id

The previous catch block aborted the transaction and rethrew the exception unchanged, so it
propagated as a non-retriable failure into WorkerSinkTask.deliverMessages and killed the
task.

Fix

Channel.send() now distinguishes recoverable re-balance failures from fatal ones:

Recoverable (CommitFailedException, InvalidProducerEpochException, including when
wrapped in another KafkaException): abort the transaction and translate to
org.apache.kafka.connect.errors.RetriableException. Connect's framework then pauses the
consumer, retains the message batch, and re-delivers it after the re-balance settles. The
aborted transaction never advanced source offsets, so once the partitions are reassigned
the new owner resumes from the last broker-committed offset — no data loss. Any data files
flushed before the abort become orphans, recoverable by Iceberg's orphan-file expiration.

CommitterImpl.processControlEvents adds a RetriableException log line at info level so
the re-balance recovery is visible in task logs but doesn't surface as an error.

KafkaUtils.seekToLastCommittedOffsets(SinkTaskContext) is added and called from
IcebergSinkTask.close(partitions) in addition to open(partitions). This is required
under incremental cooperative re-balance, where Connect can invoke close() on a
revoked partition without a paired open() — meaning the framework's own rewind in
onPartitionsAssigned never runs for it. Seeking the main consumer here guarantees that
records read past the broker-committed offset (and never committed transactionally) are
re-fetched on the next poll.

Guarantees preserved

Source offsets are still committed atomically with the control-topic DataWritten events
via the producer transaction — unchanged.
No data loss: aborted transactions never advance source offsets, so re-delivered or
re-fetched records flow through the next successful commit.
No double-commit to Iceberg: the Coordinator's per-snapshot offset properties
(kafka.connect.offsets..) still gate replays on Coordinator restart.

Test plan

New ChannelTest covers:

First-attempt success — no retry, no abort
Empty source offsets path — sendOffsetsToTransaction is skipped
CommitFailedException → translated to RetriableException with cause preserved
InvalidProducerEpochException → translated to RetriableException
sendOffsetsToTransaction failure (not commitTransaction) is also translated
CommitFailedException wrapped in another KafkaException is detected via cause chain
Non-re-balance KafkaException is rethrown as-is
beginTransaction failure is rethrown without spurious abortTransaction
abortTransaction failure is swallowed and does not mask the RetriableException

Existing WorkerTest, CoordinatorTest, CommitterImplTest, IcebergSinkTaskTest and the
rest of the kafka-connect suite pass unchanged.

This reverts commit 67619ec.

This reverts commit c0a2665.

…rebalance which requires manual intervention to restart the failed tasks

kumarpritam863 · 2026-05-16T14:53:38Z

@danielcweeks can you please review this.

Pritam Kumar Mishra and others added 26 commits August 9, 2025 11:27

added metadat and data path in case of dynamic routing

c0a2665

spotless

67619ec

Revert "spotless"

6b15ae4

This reverts commit 67619ec.

Revert "added metadat and data path in case of dynamic routing"

8398e4c

This reverts commit c0a2665.

Merge branch 'apache:main' into main

fbf52a9

Merge branch 'apache:main' into main

c92ec66

Merge branch 'apache:main' into main

9392a6d

Merge branch 'apache:main' into main

ecd8b55

Merge branch 'apache:main' into main

5e76e04

Merge branch 'apache:main' into main

a1ec7e6

Merge branch 'apache:main' into main

4eaf70b

Merge branch 'apache:main' into main

1508513

Merge branch 'apache:main' into main

e5908c8

Merge branch 'apache:main' into main

cbefe9a

Merge branch 'apache:main' into main

57d4667

Merge branch 'apache:main' into main

ee658ea

Merge branch 'apache:main' into main

7c5976d

Merge branch 'apache:main' into main

8a9654f

Merge branch 'apache:main' into main

888e659

Merge branch 'apache:main' into main

1daa4dd

Merge branch 'apache:main' into main

8b7ec63

Merge branch 'apache:main' into main

92c7e89

Merge branch 'apache:main' into main

fe9e7f0

Merge branch 'apache:main' into main

aaa8f89

Merge branch 'apache:main' into main

9d64ec9

added provision to tolearte transient commit failed exception during …

292909c

…rebalance which requires manual intervention to restart the failed tasks

github-actions Bot added the KAFKACONNECT label May 16, 2026

kumarpritam863 changed the title ~~Feature/tolerate commit failed and producer epoch exp during rebalance~~ Kafka Connect: Tolerate CommitFailedException and InvalidProducerEpochException during rebalance May 16, 2026

Pritam Kumar Mishra added 2 commits May 16, 2026 20:18

added list of exception on which to retry configurable

fec8aaa

added list of exception on which to retry configurable

f7e62c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kafka Connect: Tolerate CommitFailedException and InvalidProducerEpochException during rebalance#16366

Kafka Connect: Tolerate CommitFailedException and InvalidProducerEpochException during rebalance#16366
kumarpritam863 wants to merge 28 commits into
apache:mainfrom
kumarpritam863:feature/tolerate_commit_failed_and_producer_epoch_exp_during_rebalance

kumarpritam863 commented May 16, 2026 •

edited

Loading

Uh oh!

kumarpritam863 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kumarpritam863 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Uh oh!

kumarpritam863 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kumarpritam863 commented May 16, 2026 •

edited

Loading