destinations v2: snowflake: single-threaded T+D per stream#29878
Conversation
This code isn't actually used by destination-bigquery. It's only used by destinations that write actual files (e.g. s3/gcs).
disable dv2 for redshift; enable it for snowflake bigquery isn't on this code path because bigquery doesn't use CSV
| // we should skip it. | ||
| LOGGER.warn("Skipping typing and deduping for {}.{} because we could not set up the tables for this stream.", originalNamespace, originalName); | ||
| return; | ||
| synchronized(tdLocks.get(streamConfig.id())) { |
There was a problem hiding this comment.
Java question about synchronized(): Will all threads eventually run this block, but wait their turn, or, will any thread which can't acquire the lock give up and move on? As we will T&D once more at the end of every sync, if there is a race condition, it seems like we should opt for the faster choice, and the later thread should just skip this step.
There was a problem hiding this comment.
good point. synchronized means they'll all wait their turn. probably need to use a stronger concurrency object here.
we'll still need a way to force T+D to run at the end of a sync (i.e. using the wait-my-turn behavior). There's probably something fancy to do there, e.g. track the last time we started a T+D run vs the last time we committed raw data and skip if there's no new data... but that can be a future improvement
There was a problem hiding this comment.
we'll still need a way to force T+D to run at the end of a sync
for some reason I thought we hooked into the shutdown behavior for that 🤷
There was a problem hiding this comment.
shutdown behavior
we do. I think I was just mistaken about this - was worried about a case where we have an in-flight T+D run while the shutdown hook is running, which I don't think is possible.
There was a problem hiding this comment.
added an explicit mustRun param anyway, because I was having a hard time writing a comment explaining why it isn't necessary 🤷 and if anyone ever gets clever and runs T+D in a separate thread, then we probably need it anyway
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Joe Bell (jbfbell)
left a comment
There was a problem hiding this comment.
Edward Gao (@edgao) do you have a sense of the performance impact for this vs. only T&D'ing at the end?
|
destination-snowflake will slow down, and bigquery will remain equally slow. This PR makes T+D completely block raw table writes, whereas previously you could write new raw records while T+D runs (which is wrong, but faster 😛 ). Async snowflake was making use of that, but bigquery at least was already single-threaded anyway. I'm inclined to close this PR, unless we want to keep the early-sync T+D run. |
I would like to keep the early T&D run... maybe we do it only the first batch... but I don't want to write off "see your data mid-sync" just yet |
|
oh, git massively screwed up that merge 🤦 typeAndDedupeTask needs to have the lock logic, and typeAndDedupe(String, String) needs to call typeAndDedupeTask |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
| Step | Result |
|---|---|
| Java Connector Unit Tests | ✅ |
| Build connector tar | ✅ |
| Build destination-bigquery docker image for platform linux/x86_64 | ✅ |
| Java Connector Integration Tests | ✅ |
| Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml | ✅ |
| Connector version semver check | ✅ |
| Connector version increment check | ✅ |
| QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-bigquery test
|
| Step | Result |
|---|---|
| Java Connector Unit Tests | ✅ |
| Build connector tar | ✅ |
| Build destination-snowflake docker image for platform linux/x86_64 | ✅ |
| Java Connector Integration Tests | ✅ |
| Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml | ✅ |
| Connector version semver check | ✅ |
| Connector version increment check | ✅ |
| QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-snowflake test
|
| Step | Result |
|---|---|
| Java Connector Unit Tests | ✅ |
| Build connector tar | ✅ |
| Build destination-snowflake docker image for platform linux/x86_64 | ✅ |
| Java Connector Integration Tests | ✅ |
| Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml | ✅ |
| Connector version semver check | ✅ |
| Connector version increment check | ✅ |
| QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-snowflake test
|
| Step | Result |
|---|---|
| Java Connector Unit Tests | ✅ |
| Build connector tar | ✅ |
| Build destination-bigquery docker image for platform linux/x86_64 | ✅ |
| Java Connector Integration Tests | ✅ |
| Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml | ✅ |
| Connector version semver check | ✅ |
| Connector version increment check | ✅ |
| QA checks | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=destination-bigquery test
Closes #30048
Add locks to both T+D and raw table commits for snowflake/bigquery:
copyIntoTableFromStagecalls can run concurrentlytypeAndDedupecan only run one at a timeAnd reenables mid-sync T+D execution for both snowflake, and bigquery GCS.
Async folks - PTAL at GeneralStagingFunctions and let me know if it looks problematic for destination-bigquery.
Example logs:

Waiting for raw table writes to pauselog message at the top)Attempting typing and deduping- note the timestamp is a few seconds after the first message)Another thread is already trying to run typing and deduping)