Support --checksum-data flag, on-the-fly checksum verification #867

shlomi-noach · 2020-07-28T09:00:35Z

resubmission of openark#4 from downstream

This PR introduces --checksum-data, an opt-in checksum verification that runs throughout the migration.

With --checksum-data enabled, each rowcopy (a range of rows copied from the original table to the ghost table) is followed by a checksum on the two tables for that range.

Checksums are executed concurrently to rowcopy and are the exception to the single thread model for gh-ost.

A checksum may well fail while the migration is running: since gh-ost works in async design, where binlog entries are applied at some point in time after they're generated, it's quite possible that ongoing traffic will make some checksums fail.

A failed range's checksum is retried and retried until successful.

When --checksum-data is enabled, cut-over does not complete if failed checksums are found. While tables are locked in preparation for cut-over, a grace period is given so that the checksum evaluation can run to completion.

This is experimental.

Risk assessment: risky!

With flag disabled (as is the default case), behavior does not change and risk is low. With flag enabled, the following happen (or can happen):

More reads directly on master server: these are the checksum tests; they take place on both original table and ghost table. It's worth noting that the row-copy operation runs a full scan on the original table anyhow, and so the extra reads do not (should not) bring into memory data pages not already brought into memory by row-copy.
Slower migration time due to extra reads
Risk at time of cut-over. At this time I have no access to a busy production server so I have not verified. The following scenario is possible:
- migration is ready for cut-over
- there's many checksums not fully verified yet (that's because production traffic was busy and changed data even while checksums were calculated)
- gh-ost begins cut-over, thus locks table for writes
- Table data is now static, so theoretically all checksums should be good.
- But there's so many checksums to evaluate that we get timeout, thus rolling back migration.
- repeat.
To clarify that I haven't seen this, but I predict this might show up in prod.

I'm presenting this PR upstream for visibility. It's an important change that further validates (or invalidates!) the correctness of migrated data so it may be of interest. I'd suggest massive experimentation.

Updates from upstream

Using golang 1.14

Actions/Workflow: upload artifact

Support a complete ALTER TABLE statement in --alter

shlomi-noach added 30 commits Jun 28, 2020

Using golang 1.14

1a8c372

checksums

fb4aca1

Actions/workflows: upload binary artifact

Loading status checks…

2b71b73

Support --checksum-data flag, on-the-fly checksum verification

b60b12d

extra table timeout when checksum-data is enabled

5c0d9ab

builder tests

4be4cb9

builder tests

ed7aa85

builder tests

edc1053

expect 1.14 and above in build scripts; update to readme.md

Loading status checks…

8eb300b

better iteration on checksum comparison

b774dc1

iteration on string representation

aa33f10

Visibility into pending/successful cehcksum comparisons

f430ba4

better synchronization logic

1f47f52

GhostUniqueKey

3907a13

Using GhostUniqueKey for building checksum query on ghost table

1182ad0

stricter checksum with IFNULL

8e43847

Support a complete ALTER TABLE statement in --alter

Loading status checks…

6c7b473

Merge branch 'master' into parse-alter-statement

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

f482356

Updating and using AlterTableOptions

Loading status checks…

c9249f2

Merge pull request #6 from github/master

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

87595b1

Updates from upstream

Merge branch 'master' into parse-alter-statement

Loading status checks…

88c73c0

comments

Loading status checks…

731df3c

Merge branch 'master' into golang1.14

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

d1fcef4

Merge branch 'master' into workflow-upload-artifact

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

b9d400a

Merge pull request #1 from openark/golang1.14

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

34d1624

Using golang 1.14

Merge branch 'master' into workflow-upload-artifact

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

b54d256

Merge branch 'master' into rowcopy-checksum

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

1083109

removed debug messages

317c807

fix log call

Loading status checks…

1de2b5d

Merge pull request #2 from openark/workflow-upload-artifact

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

9b2a04d

Actions/Workflow: upload artifact

shlomi-noach added 3 commits Jul 29, 2020

extra unit test checks

Loading status checks…

ae4dd18

Merge pull request #5 from openark/parse-alter-statement

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

9ccde4f

Support a complete ALTER TABLE statement in --alter

Merge branch 'master' into rowcopy-checksum

Loading status checks…

e9fbd4e

timvaillancourt added the needs-testing label Aug 19, 2020

Oct	NOV	Dec
	21
2019	2020	2021

github / gh-ost

Support --checksum-data flag, on-the-fly checksum verification #867

Support --checksum-data flag, on-the-fly checksum verification #867

shlomi-noach commented Jul 28, 2020

github / gh-ost

Join GitHub today

GitHub is where the world builds software

Support --checksum-data flag, on-the-fly checksum verification #867

Support --checksum-data flag, on-the-fly checksum verification #867

Conversation

shlomi-noach commented Jul 28, 2020

Risk assessment: risky!

Essential cookies

Always active

Analytics cookies