The Wayback Machine - https://web.archive.org/web/20220709170213/https://github.com/github/gh-ost/pull/1003
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert column to destination charset in DML applications #1003

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jbielick
Copy link

@jbielick jbielick commented Jul 9, 2021

Related issue: #290

Description

Avoid DML apply errors by converting character data when charset changes for a column.

Background

When applying DML events (UPDATE, DELETE, etc) the values are sprintf'd into a prepared statement with the row snapshot (values). Due to the possibility of migrating text column data containing characters in the source table that are invalid in the destination table (due to charset), a conversion step is often necessary. This conversion does not occur when applying DML events and an error occurs writing invalid byte sequences to the ghost table.

For example, when migrating a table/column from latin1 to utf8mb4, the latin1 column may contain characters that are invalid single-byte utf8 characters. Characters in the \x80-\xFF range are most common. When written to utf8mb4 column without conversion, they fail as they do not exist in the utf8 codepage.

Converting these texts/characters to the destination charset using convert(? using {charset}) will convert appropriately and the update/replace will succeed.

Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-select-from and it all occurs within mysql. I only point this out because there are two tests added for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert—the former is a test that fails prior to this commit. The latter is a test that succeeds prior to this comment. Both are affected by the code in this commit. Please let me know if you would like these renamed or consolidated into convert-utf8mb4. They were helpful to do TDD.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.
@jbielick jbielick force-pushed the latin1-utf8mb4 branch 2 times, most recently from b649015 to 9a3eb5f Compare Jul 9, 2021
@@ -7,7 +7,7 @@ create table gh_ost_test (
primary key(id)
) auto_increment=1;

insert into gh_ost_test values (null, 'átesting');
insert into gh_ost_test values (null, 'átesting', '', '');
Copy link
Author

@jbielick jbielick Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes Testing: convert-utf8mb4..ERROR 1136 (21S01) at line 10: Column count doesn't match value count at row 1

Mismatch didn't seem intentional; I can revert if needed.

primary key(id)
) auto_increment=1 charset latin1 collate latin1_swedish_ci;

insert into gh_ost_test values (null, char(189));
Copy link
Author

@jbielick jbielick Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

189 and 190 chosen specifically because they can be replaced with a utf8 character equivalent (prefixed with \xC2). Thus the checksum at the end of the test will pass.

@shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jul 11, 2021

@jbielick I'm curious to see how the tests run. Would you care to push same branch (and create similar PR) in https://github.com/openark/gh-ost ? (If not, I can to the same on your behalf; but you're the author so it's best if you do it)

Notwithstanding #290, which I completely missed as I was phasing out, I'm surprised this issue only shows up now.

@jbielick
Copy link
Author

@jbielick jbielick commented Jul 12, 2021

@shlomi-noach me, too. I got them running locally with mysql 5.7.25 and a couple changes to the Dockerfile and they all passed. Looking forward to a CI run.

Rebased and opened at:

openark#27

@shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jul 12, 2021

Thank you!

jbielick added 2 commits Jul 14, 2021
this test assumes a latin1-encoded table with content containing bytes
in the \x80-\xFF, which are invalid single-byte characters in utf8 and
cannot be inserted in the altered table when the column containing these
characters is changed to utf8(mb4).

since these characters cannot be inserted, gh-ost fails.
addresses github#290

Note: there is currently no issue backfilling the ghost table when the
characterset changes, likely because it's a insert-into-select-from and
it all occurs within mysql.

However, when applying DML events (UPDATE, DELETE, etc) the values are
sprintf'd into a prepared statement and due to the possibility of
migrating text column data containing invalid characters in the
destination charset, a conversion step is often necessary.

For example, when migrating a table/column from latin1 to utf8mb4, the
latin1 column may contain characters that are invalid single-byte utf8
characters. Characters in the \x80-\xFF range are most common. When
written to utf8mb4 column without conversion, they fail as they do not
exist in the utf8 codepage.

Converting these texts/characters to the destination charset using
convert(? using {charset}) will convert appropriately and the
update/replace will succeed.

I only point out the "Note:" above because there are two tests added
for this: latin1text-to-utf8mb4 and latin1text-to-ut8mb4-insert

The former is a test that fails prior to this commit. The latter is a
test that succeeds prior to this comment. Both are affected by the code
in this commit.

convert text to original charset, then destination

converting text first to the original charset and then to the
destination charset produces the most consistent results, as inserting
the binary into a utf8-charset column may encounter an error if there is
no prior context of latin1 encoding.

mysql> select hex(convert(char(189) using utf8mb4));
+---------------------------------------+
| hex(convert(char(189) using utf8mb4)) |
+---------------------------------------+
|                                       |
+---------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> select hex(convert(convert(char(189) using latin1) using utf8mb4));
+-------------------------------------------------------------+
| hex(convert(convert(char(189) using latin1) using utf8mb4)) |
+-------------------------------------------------------------+
| C2BD                                                        |
+-------------------------------------------------------------+
1 row in set (0.00 sec)

as seen in this failure on 5.5.62

 Error 1300: Invalid utf8mb4 character string: 'BD'; query=
			replace /* gh-ost `test`.`_gh_ost_test_gho` */ into
				`test`.`_gh_ost_test_gho`
					(`id`, `t`)
				values
					(?, convert(? using utf8mb4))
@jbielick
Copy link
Author

@jbielick jbielick commented Jul 14, 2021

@shlomi-noach cherry-picked commits here.

@shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jul 14, 2021

@timvaillancourt 👋 this looks legit to me. However, it works on data transfer (conversion of character sets) and I'd like to see how this works under production tests. Would you be able to deploy this change?

@timvaillancourt
Copy link
Collaborator

@timvaillancourt timvaillancourt commented Jul 17, 2021

@timvaillancourt 👋 this looks legit to me. However, it works on data transfer (conversion of character sets) and I'd like to see how this works under production tests. Would you be able to deploy this change?

@shlomi-noach 👋 absolutely! I can get this on some testing hosts next week 👍

@timvaillancourt timvaillancourt added this to the v1.1.3 milestone Jul 18, 2021
@timvaillancourt timvaillancourt self-requested a review Jul 18, 2021
@timvaillancourt timvaillancourt temporarily deployed to production/mysql_role=ghost_testing Jul 19, 2021 Inactive
@jbielick
Copy link
Author

@jbielick jbielick commented Jul 26, 2021

Anything go awry?

@jbielick
Copy link
Author

@jbielick jbielick commented Sep 9, 2021

@timvaillancourt @shlomi-noach Hey y'all. 👋

I'm wondering if there is anything I can do to help here. Does this still seem like an appropriate change to merge?

To add some more context, we're in need of this change to successfully convert large tables from latin1 to utf8mb4. Backfilling goes just fine, but updates that are applied encounter errors (as they they do SQL writes/update differently than the backfill) and the whole things dies. Any issues or edge cases I can look into? This is understandably high-risk.

If I were to build a binary from this branch to test in our prod DB, should I rebase and do so from the openark repo?

@timvaillancourt
Copy link
Collaborator

@timvaillancourt timvaillancourt commented Jun 29, 2022

I'm wondering if there is anything I can do to help here. Does this still seem like an appropriate change to merge?

@jbielick / @shlomi-noach: 👋 the production tests of this PR ran into some unrelated test failures that are still under investigation. I think tests of this PR should be ran again to make sure this didn't break anything. cc'ing @dm-2 / @rashiq for help on tests

And just calling out that while the automated tests will prove gh-ost still works in general, it won't trigger the charset logic from this PR as the production test runs ENGINE=InnoDB only. localtests/ might be our best bet to test this logic. If there is any more test scenarios that can be added, that would make this safer @jbielick (no specific ideas right now)

Lastly, I deleted the ignore_versions file for the 5.5 version because we dropped support/tests for 5.5 so no need to ignore it

@jbielick
Copy link
Author

@jbielick jbielick commented Jun 30, 2022

Thanks for the update.

it won't trigger the charset logic from this PR as the production test runs ENGINE=InnoDB only

May I ask what you mean in this statement? I was expecting this to affect InnoDB tables in test and production in the intended way, but perhaps I'm missing or misunderstanding something.

@dm-2 dm-2 removed this from the v1.1.5 milestone Jul 6, 2022
@dm-2 dm-2 added this to the v1.1.6 milestone Jul 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants