destinations v2: snowflake: single-threaded T+D per stream by edgao · Pull Request #29878 · airbytehq/airbyte

Edward Gao (edgao) · 2023-08-25T22:36:05Z

Add locks to both T+D and raw table commits for snowflake/bigquery:

any number of copyIntoTableFromStage calls can run concurrently
typeAndDedupe can only run one at a time

And reenables mid-sync T+D execution for both snowflake, and bigquery GCS.

Async folks - PTAL at GeneralStagingFunctions and let me know if it looks problematic for destination-bigquery.

Example logs:

One thread tries to start T+D (the Waiting for raw table writes to pause log message at the top)
In the meantime, another thread is flushing to the raw tables, so nothing happens
The flush finishes, so the first thread is able to start running T+D (Attempting typing and deduping - note the timestamp is a few seconds after the first message)
The second thread also tries to trigger T+D, but skips it because T+D is already running (Another thread is already trying to run typing and deduping)

This code isn't actually used by destination-bigquery. It's only used by destinations that write actual files (e.g. s3/gcs).

disable dv2 for redshift; enable it for snowflake bigquery isn't on this code path because bigquery doesn't use CSV

Evan Tahler (evantahler) · 2023-08-28T15:55:42Z

-      // we should skip it.
-      LOGGER.warn("Skipping typing and deduping for {}.{} because we could not set up the tables for this stream.", originalNamespace, originalName);
-      return;
+    synchronized(tdLocks.get(streamConfig.id())) {


Java question about synchronized(): Will all threads eventually run this block, but wait their turn, or, will any thread which can't acquire the lock give up and move on? As we will T&D once more at the end of every sync, if there is a race condition, it seems like we should opt for the faster choice, and the later thread should just skip this step.

good point. synchronized means they'll all wait their turn. probably need to use a stronger concurrency object here.

we'll still need a way to force T+D to run at the end of a sync (i.e. using the wait-my-turn behavior). There's probably something fancy to do there, e.g. track the last time we started a T+D run vs the last time we committed raw data and skip if there's no new data... but that can be a future improvement

we'll still need a way to force T+D to run at the end of a sync

for some reason I thought we hooked into the shutdown behavior for that 🤷

shutdown behavior

we do. I think I was just mistaken about this - was worried about a case where we have an in-flight T+D run while the shutdown hook is running, which I don't think is possible.

added an explicit mustRun param anyway, because I was having a hard time writing a comment explaining why it isn't necessary 🤷 and if anyone ever gets clever and runs T+D in a separate thread, then we probably need it anyway

Joe Bell (jbfbell)

Edward Gao (@edgao) do you have a sense of the performance impact for this vs. only T&D'ing at the end?

Edward Gao (edgao) · 2023-09-01T20:47:18Z

destination-snowflake will slow down, and bigquery will remain equally slow. This PR makes T+D completely block raw table writes, whereas previously you could write new raw records while T+D runs (which is wrong, but faster 😛 ). Async snowflake was making use of that, but bigquery at least was already single-threaded anyway.

I'm inclined to close this PR, unless we want to keep the early-sync T+D run.

Evan Tahler (evantahler) · 2023-09-01T21:58:17Z

I'm inclined to close this PR, unless we want to keep the early-sync T+D run.

I would like to keep the early T&D run... maybe we do it only the first batch... but I don't want to write off "see your data mid-sync" just yet

Edward Gao (edgao) · 2023-09-07T00:06:34Z

oh, git massively screwed up that merge 🤦 typeAndDedupeTask needs to have the lock logic, and typeAndDedupe(String, String) needs to call typeAndDedupeTask

github-actions · 2023-09-12T16:37:50Z

destination-bigquery test report (commit `0d23374753`) - ✅

⏲️ Total pipeline duration: 04mn10s

Step	Result
Java Connector Unit Tests	✅
Build connector tar	✅
Build destination-bigquery docker image for platform linux/x86_64	✅
Java Connector Integration Tests	✅
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

github-actions · 2023-09-12T16:52:51Z

destination-snowflake test report (commit `0d23374753`) - ✅

⏲️ Total pipeline duration: 14mn50s

Step	Result
Java Connector Unit Tests	✅
Build connector tar	✅
Build destination-snowflake docker image for platform linux/x86_64	✅
Java Connector Integration Tests	✅
Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

github-actions · 2023-09-12T17:14:24Z

destination-snowflake test report (commit `c359ce6bbe`) - ✅

⏲️ Total pipeline duration: 15mn16s

Step	Result
Java Connector Unit Tests	✅
Build connector tar	✅
Build destination-snowflake docker image for platform linux/x86_64	✅
Java Connector Integration Tests	✅
Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

github-actions · 2023-09-12T17:28:43Z

destination-bigquery test report (commit `c359ce6bbe`) - ✅

⏲️ Total pipeline duration: 14mn08s

Step	Result
Java Connector Unit Tests	✅
Build connector tar	✅
Build destination-bigquery docker image for platform linux/x86_64	✅
Java Connector Integration Tests	✅
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml	✅
Connector version semver check	✅
Connector version increment check	✅
QA checks	✅

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

Edward Gao (edgao) and others added 24 commits August 23, 2023 16:23

copy spec changes

d9aa4d1

logistics

4d10768

remove normalization from build

5e0f08f

remove unnecessary change

e160ad6

This code isn't actually used by destination-bigquery. It's only used by destinations that write actual files (e.g. s3/gcs).

inject param to stagingcsvgenerator+stagingconsumerfactory

dd0a8db

disable dv2 for redshift; enable it for snowflake bigquery isn't on this code path because bigquery doesn't use CSV

inject to GcsUtils

0e1f518

hardcode snowflake

b5ae4d3

hardcode in bigquery

0c8248d

Merge branch 'master' into edgao/dv2/release

1dbd194

derp, fix default behavior

5a978ed

derp

050c1d9

derp

46e1e67

maybe make bigquery tests pass

0f14339

fix snowflake tests?

7c4cd0c

Merge branch 'master' into edgao/dv2/release

2402243

fix snowflake unit tests

dc619e0

more snowflake test fix

a7f92d1

Automated Commit - Format and Process Resources Changes

86f4ffc

disable legacy DATs on snowflake + bigquery

b244993

Update upgrade copy

9c3bcac

Merge branch 'master' into edgao/dv2/release

cb5e32b

also disable these tests 🤷

be9b921

one more

155e391

prevent concurrent T+D

05b58a3

Octavia Squidington III (octavia-squidington-iii) added area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Aug 25, 2023

Edward Gao (edgao) changed the base branch from master to edgao/dv2/release August 25, 2023 22:36

Edward Gao (edgao) marked this pull request as ready for review August 25, 2023 22:38

Evan Tahler (evantahler) reviewed Aug 28, 2023

View reviewed changes

add better locks

c31f1ff

This comment was marked as outdated.

Sign in to view

Joe Bell (jbfbell) reviewed Sep 1, 2023

View reviewed changes

Merge branch 'master' into edgao/dv2/snowflake/locks

7d0f922

botched merge

e214fec

Joe Bell (jbfbell) reviewed Sep 7, 2023

View reviewed changes

Comment thread .../main/java/io/airbyte/integrations/base/destination/typing_deduping/DefaultTyperDeduper.java

Comment thread ...c/main/java/io/airbyte/integrations/destination/bigquery/BigQueryStagingConsumerFactory.java Outdated

Joe Bell (jbfbell) suggested changes Sep 7, 2023

View reviewed changes

Comment thread ...c/main/java/io/airbyte/integrations/destination/bigquery/BigQueryStagingConsumerFactory.java Outdated

Edward Gao (edgao) added 2 commits September 7, 2023 16:44

refactor

10b8356

prevent incremental T+D for bq staging

85fc426

Joe Bell (jbfbell) approved these changes Sep 8, 2023

View reviewed changes

Merge branch 'master' into edgao/dv2/snowflake/locks

a032c35

Edward Gao (edgao) enabled auto-merge (squash) September 12, 2023 15:28

This comment was marked as outdated.

Sign in to view

spotbugs

cf229de

This comment was marked as outdated.

Sign in to view

why no autoformat?

0d23374

Octavia Squidington III (octavia-squidington-iii) added the connectors/source/scaffold-java-jdbc label Sep 12, 2023

Edward Gao (edgao) and others added 2 commits September 12, 2023 16:56

Automated Commit - Format and Process Resources Changes

3fb9769

Merge branch 'master' into edgao/dv2/snowflake/locks

c359ce6

Edward Gao (edgao) merged commit e0ce2ac into master Sep 12, 2023

Edward Gao (edgao) deleted the edgao/dv2/snowflake/locks branch September 12, 2023 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

destinations v2: snowflake: single-threaded T+D per stream#29878

destinations v2: snowflake: single-threaded T+D per stream#29878
Edward Gao (edgao) merged 48 commits into
masterfrom
edgao/dv2/snowflake/locks

Edward Gao (edgao) commented Aug 25, 2023 •

edited

Loading

Evan Tahler (evantahler) Aug 28, 2023

Edward Gao (edgao) Aug 28, 2023

Evan Tahler (evantahler) Aug 28, 2023 •

edited

Loading

Edward Gao (edgao) Aug 28, 2023

Edward Gao (edgao) Aug 28, 2023

This comment was marked as outdated.

This comment was marked as outdated.

Joe Bell (jbfbell) left a comment

Edward Gao (edgao) commented Sep 1, 2023

Evan Tahler (evantahler) commented Sep 1, 2023 •

edited

Loading

Edward Gao (edgao) commented Sep 7, 2023

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

github-actions Bot commented Sep 12, 2023

github-actions Bot commented Sep 12, 2023

github-actions Bot commented Sep 12, 2023

github-actions Bot commented Sep 12, 2023

Labels

5 participants

Conversation

Edward Gao (edgao) commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evan Tahler (evantahler) Aug 28, 2023

Choose a reason for hiding this comment

Edward Gao (edgao) Aug 28, 2023

Choose a reason for hiding this comment

Evan Tahler (evantahler) Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Edward Gao (edgao) Aug 28, 2023

Choose a reason for hiding this comment

Edward Gao (edgao) Aug 28, 2023

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

Joe Bell (jbfbell) left a comment

Choose a reason for hiding this comment

Edward Gao (edgao) commented Sep 1, 2023

Evan Tahler (evantahler) commented Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edward Gao (edgao) commented Sep 7, 2023

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

github-actions Bot commented Sep 12, 2023

destination-bigquery test report (commit 0d23374753) - ✅

github-actions Bot commented Sep 12, 2023

destination-snowflake test report (commit 0d23374753) - ✅

github-actions Bot commented Sep 12, 2023

destination-snowflake test report (commit c359ce6bbe) - ✅

github-actions Bot commented Sep 12, 2023

destination-bigquery test report (commit c359ce6bbe) - ✅

Labels

5 participants

Edward Gao (edgao) commented Aug 25, 2023 •

edited

Loading

Evan Tahler (evantahler) Aug 28, 2023 •

edited

Loading

Evan Tahler (evantahler) commented Sep 1, 2023 •

edited

Loading

destination-bigquery test report (commit `0d23374753`) - ✅

destination-snowflake test report (commit `0d23374753`) - ✅

destination-snowflake test report (commit `c359ce6bbe`) - ✅

destination-bigquery test report (commit `c359ce6bbe`) - ✅