New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert column to destination charset in DML applications #1003
base: master
Are you sure you want to change the base?
Conversation
b649015
to
9a3eb5f
Compare
| @@ -7,7 +7,7 @@ create table gh_ost_test ( | |||
| primary key(id) | |||
| ) auto_increment=1; | |||
|
|
|||
| insert into gh_ost_test values (null, 'átesting'); | |||
| insert into gh_ost_test values (null, 'átesting', '', ''); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes Testing: convert-utf8mb4..ERROR 1136 (21S01) at line 10: Column count doesn't match value count at row 1
Mismatch didn't seem intentional; I can revert if needed.
| primary key(id) | ||
| ) auto_increment=1 charset latin1 collate latin1_swedish_ci; | ||
|
|
||
| insert into gh_ost_test values (null, char(189)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
189 and 190 chosen specifically because they can be replaced with a utf8 character equivalent (prefixed with \xC2). Thus the checksum at the end of the test will pass.
|
@jbielick I'm curious to see how the tests run. Would you care to push same branch (and create similar PR) in https://github.com/openark/gh-ost ? (If not, I can to the same on your behalf; but you're the author so it's best if you do it) Notwithstanding #290, which I completely missed as I was phasing out, I'm surprised this issue only shows up now. |
|
@shlomi-noach me, too. I got them running locally with mysql 5.7.25 and a couple changes to the Dockerfile and they all passed. Looking forward to a CI run. Rebased and opened at: |
|
Thank you! |
this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be inserted in the altered table when the column containing these characters is changed to utf8(mb4). since these characters cannot be inserted, gh-ost fails.
addresses github#290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. However, when applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement and due to the possibility of migrating text column data containing invalid characters in the destination charset, a conversion step is often necessary. For example, when migrating a table/column from latin1 to utf8mb4, the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage. Converting these texts/characters to the destination charset using convert(? using {charset}) will convert appropriately and the update/replace will succeed. I only point out the "Note:" above because there are two tests added for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert The former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. convert text to original charset, then destination converting text first to the original charset and then to the destination charset produces the most consistent results, as inserting the binary into a utf8-charset column may encounter an error if there is no prior context of latin1 encoding. mysql> select hex(convert(char(189) using utf8mb4)); +---------------------------------------+ | hex(convert(char(189) using utf8mb4)) | +---------------------------------------+ | | +---------------------------------------+ 1 row in set, 1 warning (0.00 sec) mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4)); +-------------------------------------------------------------+ | hex(convert(convert(char(189) using latin1) using utf8mb4)) | +-------------------------------------------------------------+ | C2BD | +-------------------------------------------------------------+ 1 row in set (0.00 sec) as seen in this failure on 5.5.62 Error 1300: Invalid utf8mb4 character string: 'BD'; query= replace /* gh-ost `test`.`_gh_ost_test_gho` */ into `test`.`_gh_ost_test_gho` (`id`, `t`) values (?, convert(? using utf8mb4))
|
@shlomi-noach cherry-picked commits here. |
|
@timvaillancourt |
@shlomi-noach |
|
Anything go awry? |
|
@timvaillancourt @shlomi-noach Hey y'all. I'm wondering if there is anything I can do to help here. Does this still seem like an appropriate change to merge? To add some more context, we're in need of this change to successfully convert large tables from If I were to build a binary from this branch to test in our prod DB, should I rebase and do so from the openark repo? |
@jbielick / @shlomi-noach: And just calling out that while the automated tests will prove gh-ost still works in general, it won't trigger the charset logic from this PR as the production test runs Lastly, I deleted the |
|
Thanks for the update.
May I ask what you mean in this statement? I was expecting this to affect InnoDB tables in test and production in the intended way, but perhaps I'm missing or misunderstanding something. |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

Related issue: #290
Description
Avoid DML apply errors by
converting character data whencharsetchanges for a column.Background
When applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement with the row snapshot (values). Due to the possibility of migrating text column data containing characters in the source table that are invalid in the destination table (due to charset), a conversion step is often necessary. This conversion does not occur when applying DML events and an error occurs writing invalid byte sequences to the ghost table.
For example, when migrating a table/column from
latin1toutf8mb4, thelatin1column may contain characters that are invalid single-byteutf8characters. Characters in the\x80-\xFFrange are most common. When written toutf8mb4column without conversion, they fail as they do not exist in theutf8codepage.Converting these texts/characters to the destination charset using
convert(? using {charset})will convert appropriately and the update/replace will succeed.Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. I only point this out because there are two tests added for this:
latin1text-to-utf8mb4andlatin1text-to-ut8mb4-insert—the former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. Please let me know if you would like these renamed or consolidated intoconvert-utf8mb4. They were helpful to do TDD.script/cibuildreturns with no formatting errors, build errors or unit test errors.