[FLINK-36701][cdc-runtime] Add steps to get and emit schemaManager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent #3802

Jzjsnow · 2024-12-13T11:48:10Z

Currently, directly after a failover, when the pipeline first handles a schema change event (e.g. AddColumnEvent) and then a DataChangeEvent, it may cause the job to fail again as sink has repeatedly applied that schema change.

The issue is revealed as follows:

I add steps to get and emit schemaManager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent. If the operator doesn't have a local cache of the schema when handling the flushevent, it will request Schema Manager's latest evolved schema. At this point the evolved schema is the same with the sink table's, as the schema change event hasn't been applied to the evolved schema by the master yet.

Now the new process is changed as follows:

...untime/src/main/java/org/apache/flink/cdc/runtime/operators/sink/DataSinkWriterOperator.java

yuxiqian · 2024-12-17T02:31:49Z

Thanks for @Jzjsnow's clear and detailed diagram!

Just noticed that we still lack failover tests for pipeline jobs. Could we add some recovering tests like MySqlSourceITCase#testMySqlParallelSource and trigger TM/JM Failures in different phases? flink-cdc-composer unit tests might be a place for it.

Jzjsnow · 2025-01-02T05:54:59Z

Thanks for @Jzjsnow's clear and detailed diagram!

Just noticed that we still lack failover tests for pipeline jobs. Could we add some recovering tests like MySqlSourceITCase#testMySqlParallelSource and trigger TM/JM Failures in different phases? flink-cdc-composer unit tests might be a place for it.

@yuxiqian Thanks for the suggestion, I have added some tests to the flink-cdc-runtime unit test for testing DataSinkOperator's handling of schema change events, which was lacking. Among them, DataSinkOperatorWithSchemaEvolveTest#testSchemaChangeEventAfterFailover is used to test the underlying schema change process after failure recovery, which is applicable to the scenario of restarting after jm/tm failure.

Jzjsnow · 2025-01-02T07:01:38Z

Here it looks like we have once again encountered a OceanBaseMySQLModeITCase test failure similar to the one in #3712. Any ideas on how to fix this?

yuxiqian · 2025-01-02T08:26:14Z

It's worrying to notice similar OceanBase test case failures here, but seems irrelevant to this PR. Will investigate this.

yuxiqian · 2025-01-03T06:15:20Z

@Jzjsnow Should be fixed now... please rebase to master branch.

…latest evolvedSchema when SinkDataWriterOperator handles FlushEvent.

Jzjsnow · 2025-01-03T07:13:03Z

@yuxiqian Thanks for the quick fix, now we've rebased to the master branch.

Jzjsnow · 2025-01-03T09:23:49Z

The Source E2E Tests#OceanBaseE2eITCase test did not pass in this round of CIs, but it is not relevant to this PR. It seems that the 1 minute timeout set in the method checkResultWithTimeout (timeout=60000L) is a bit too short, so that the row id=111 in the sink table that should have been deleted has not been synchronized yet.

Here is the error log:

2025-01-03T07:38:53.8553280Z 1254259 [main] ERROR org.apache.flink.cdc.connectors.tests.OceanBaseE2eITCase -
2025-01-03T07:38:53.8553917Z --------------------------------------------------------------------------------
2025-01-03T07:38:53.8554680Z Test testOceanBaseCDCflinkVersion: 1.19.1 failed with:
2025-01-03T07:38:53.8555684Z array lengths differed, expected.length=10 actual.length=11; arrays first differed at element [10]; expected: but was:<111,scooter,Big 2-wheel scooter ,5.18,null,null>
2025-01-03T07:38:53.8556858Z at org.junit.internal.ComparisonCriteria.arrayEquals(ComparisonCriteria.java:89)
2025-01-03T07:38:53.8557558Z at org.junit.internal.ComparisonCriteria.arrayEquals(ComparisonCriteria.java:28)
2025-01-03T07:38:53.8558017Z at org.junit.Assert.internalArrayEquals(Assert.java:534)
2025-01-03T07:38:53.8558492Z at org.junit.Assert.assertArrayEquals(Assert.java:285)
2025-01-03T07:38:53.8558895Z at org.junit.Assert.assertArrayEquals(Assert.java:300)
2025-01-03T07:38:53.8559341Z at org.apache.flink.cdc.common.test.utils.JdbcProxy.checkResult(JdbcProxy.java:70)
2025-01-03T07:38:53.8559996Z at org.apache.flink.cdc.common.test.utils.JdbcProxy.checkResultWithTimeout(JdbcProxy.java:93)
2025-01-03T07:38:53.8560819Z at org.apache.flink.cdc.connectors.tests.OceanBaseE2eITCase.testOceanBaseCDC(OceanBaseE2eITCase.java:179)

yuxiqian

Thanks for @Jzjsnow's great work, left some minor comments.

yuxiqian · 2025-01-07T07:54:09Z

flink-cdc-common/src/main/java/org/apache/flink/cdc/common/event/FlushEvent.java


    /** Which subTask ID this FlushEvent was initiated from. */
    private final int sourceSubTaskId;

+    /** Flag indicating whether the FlushEvent is sent before a create table event. */
+    private final Boolean isForCreateTableEvent;


I'm wondering if we may need to trace FlushEvent more specifically, about what type of schema change event it was caused by.

If so, we may store a SchemaChangeEventType enum value for extensibility.

If it turns out to be unnecessary, at least we can use boolean to avoid [un]boxing.

yuxiqian · 2025-01-07T07:59:25Z

...source-e2e-tests/src/test/java/org/apache/flink/cdc/connectors/tests/OceanBaseE2eITCase.java

Maybe this change can be split into an individual commit to keep commit history accurate.

…nBaseE2eITCase.

…ager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent.

github-actions bot added common runtime labels Dec 13, 2024

yuxiqian reviewed Dec 16, 2024

View reviewed changes

...untime/src/main/java/org/apache/flink/cdc/runtime/operators/sink/DataSinkWriterOperator.java Outdated Show resolved Hide resolved

github-actions bot added docs Improvements or additions to documentation mongodb-cdc-connector base postgres-cdc-connector build mysql-cdc-connector oracle-cdc-connector dist kafka-pipeline-connector labels Dec 16, 2024

Jzjsnow force-pushed the master-Add_steps_to_get_and_emit_schemaManager's_latest_evolvedSchema_when_SinkDataWriterOperator_handles_FlushEvent branch from 20a00d4 to 705d763 Compare December 16, 2024 12:49

github-actions bot removed docs Improvements or additions to documentation mongodb-cdc-connector build mysql-cdc-connector base oracle-cdc-connector postgres-cdc-connector dist kafka-pipeline-connector labels Dec 16, 2024

Jzjsnow force-pushed the master-Add_steps_to_get_and_emit_schemaManager's_latest_evolvedSchema_when_SinkDataWriterOperator_handles_FlushEvent branch 2 times, most recently from 596fa99 to 468feec Compare January 2, 2025 01:41

yuxiqian mentioned this pull request Jan 3, 2025

[hotfix][tests] Fix unstable OceanBaseMySQLModelITCase #3831

Merged

[FLINK-36701][cdc-runtime] Add steps to get and emit schemaManager's …

84bf4d0

…latest evolvedSchema when SinkDataWriterOperator handles FlushEvent.

Jzjsnow force-pushed the master-Add_steps_to_get_and_emit_schemaManager's_latest_evolvedSchema_when_SinkDataWriterOperator_handles_FlushEvent branch from 468feec to 84bf4d0 Compare January 3, 2025 07:02

github-actions bot added the e2e-tests label Jan 7, 2025

yuxiqian reviewed Jan 7, 2025

View reviewed changes

[hotfix][tests] Extend the timeout limit for checking results in Ocea…

c7de55b

…nBaseE2eITCase.

Jzjsnow force-pushed the master-Add_steps_to_get_and_emit_schemaManager's_latest_evolvedSchema_when_SinkDataWriterOperator_handles_FlushEvent branch from e0a3e30 to c7de55b Compare January 7, 2025 11:55

jzjsnow added 2 commits January 7, 2025 20:05

fixup! [FLINK-36701][cdc-runtime] Add steps to get and emit schemaMan…

c428f4b

…ager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent.

fixup! [FLINK-36701][cdc-runtime] Add steps to get and emit schemaMan…

443708b

…ager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent.

github-actions bot added the paimon-pipeline-connector label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-36701][cdc-runtime] Add steps to get and emit schemaManager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent #3802

[FLINK-36701][cdc-runtime] Add steps to get and emit schemaManager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent #3802

Jzjsnow commented Dec 13, 2024 •

edited

Loading

yuxiqian commented Dec 17, 2024 •

edited

Loading

Jzjsnow commented Jan 2, 2025

Jzjsnow commented Jan 2, 2025

yuxiqian commented Jan 2, 2025 •

edited

Loading

yuxiqian commented Jan 3, 2025

Jzjsnow commented Jan 3, 2025

Jzjsnow commented Jan 3, 2025

yuxiqian left a comment

yuxiqian Jan 7, 2025

yuxiqian Jan 7, 2025 •

edited

Loading

[FLINK-36701][cdc-runtime] Add steps to get and emit schemaManager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent #3802

Are you sure you want to change the base?

[FLINK-36701][cdc-runtime] Add steps to get and emit schemaManager's latest evolvedSchema when SinkDataWriterOperator handles FlushEvent #3802

Conversation

Jzjsnow commented Dec 13, 2024 • edited Loading

yuxiqian commented Dec 17, 2024 • edited Loading

Jzjsnow commented Jan 2, 2025

Jzjsnow commented Jan 2, 2025

yuxiqian commented Jan 2, 2025 • edited Loading

yuxiqian commented Jan 3, 2025

Jzjsnow commented Jan 3, 2025

Jzjsnow commented Jan 3, 2025

yuxiqian left a comment

Choose a reason for hiding this comment

yuxiqian Jan 7, 2025

Choose a reason for hiding this comment

yuxiqian Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Jzjsnow commented Dec 13, 2024 •

edited

Loading

yuxiqian commented Dec 17, 2024 •

edited

Loading

yuxiqian commented Jan 2, 2025 •

edited

Loading

yuxiqian Jan 7, 2025 •

edited

Loading