Character encoding and decoding in Python with MySQL

Question

For query:

SHOW VARIABLES LIKE 'char%';

MySQL Database returns:

character_set_client    latin1
character_set_connection    latin1
character_set_database  latin1
character_set_filesystem    binary
character_set_results   latin1
character_set_server    latin1
character_set_system    utf8
character_sets_dir  /usr/local/mysql-5.7.27-macos10.14-x86_64/share/charsets/

In my Python script:

conn = get_database_connection()
conn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='latin1')

For one of the columns that has following value:

N’a pas

Python returns:

N?a pas

Between N and a, There is a star shaped question-mark. How do I read it as is? What's the best way to handle it? I have been reading about converting my db to utf-8 but that seems like a long shot with a good chance of breaking other things. Is there a more efficient way to do it?

At some of the places in code, I have done :

value = value.encode('utf-8', 'ignore').decode('utf-8')

to handle utf-8 data like accented characters but apostrophe did not get handled with the same and I ended up with ? instead of '

(1) The "fancy" apostrophe ’ (right single quotation mark, U+2019) is not part of Latin-1. Upgrading to UTF-8 is definitely the best option. It's 2020 now, UTF-8 is everywhere. (2) There are very rare cases where value.encode('utf8', 'ignore').decode('utf8') has an effect. Typographic quotes are none of them. 99.9% of the time, this expression returns the original value unchanged. — lenz
– lenz, Commented Apr 10, 2020 at 5:09
@lenz - UTF-8 would be better. However, the comment is incorrect. Hex 92 is the latin1 encoding of 'RIGHT SINGLE QUOTATION MARK'. — Rick James
– Rick James, Commented Apr 11, 2020 at 4:11
@RickJames It depends how you define "Latin-1". Code point 0x92 is a control character in standard Latin-1 (ISO-8859-1). It is a quotation mark in the Windows codepage 1252 (among others), which is a modification of the former and which is colloquially referred to as "Windows Latin 1". I don't know how MySQL defines "Latin-1"; I wouldn't be surprised if it's the latter. — lenz
– lenz, Commented Apr 11, 2020 at 14:18
@lenz - I think that MySQL's latin1 does nothing to validate bytes that it receives. Utf8, on the other hand, squawks at virtually any latin1 string with an 8-bit character, including the 92 in question. — Rick James
– Rick James, Commented Apr 11, 2020 at 22:39
@RickJames This is also true. There's no good way to validate any 8-bit encoding, unless you know what the interpreted string should look like. — lenz
– lenz, Commented Apr 12, 2020 at 11:27

Joni · Accepted Answer · 2020-04-10 14:16:56Z

Converting the database to UTF-8 is better for the long run, but risky because you may break other things like you say. What you can do is change the database connection encoding to UTF-8. That way you get UTF-8 encoded strings out of the database, without having changed how the data is actually stored.

conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf8')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf8')

If that seems too risky, but you could consider having two separate database connections, the original and one in utf8, and migrate the app to using utf8 little by little, as you have time to test.

If even that seems too risky, maybe try using a character encoding that's more similar to mysql's version of latin1. MySQL's "latin1" is actually an extended version of cp1252 encoding, which itself is a Microsoft extension of the "standard latin1" that's used in Python (among others).

conn.setdecoding(pyodbc.SQL_CHAR, encoding='cp1252')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='cp1252')

Rick James · Accepted Answer · 2020-04-11 04:14:34Z

1

Don't use any form of encoding/decoding; it only complicates your code and hides more errors. In fact, you may be trying to "make two wrongs make a right".

Go with utf8 (or utf8mb4).

Notes on "question mark": Trouble with UTF-8 characters; what I see is not what I stored
Notes on Python: http://mysql.rjweb.org/doc.php/charcoll#python

answered Apr 11, 2020 at 4:14

Rick James

144k15 gold badges144 silver badges254 bronze badges

Collectives™ on Stack Overflow

Character encoding and decoding in Python with MySQL

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related