I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
-
8Ollie Jones has a much better answer (check the bottom).Jonathan Arkell– Jonathan Arkell2012-11-12 17:45:27 +00:00Commented Nov 12, 2012 at 17:45
-
1@JonathanArkell Not on the bottom anymore :)Brilliand– Brilliand2014-05-22 22:02:22 +00:00Commented May 22, 2014 at 22:02
-
Correction.. check the middle! ;)Jonathan Arkell– Jonathan Arkell2014-05-23 15:26:44 +00:00Commented May 23, 2014 at 15:26
-
This is the answer @Jonathan is talking about stackoverflow.com/a/11741314/792066Braiam– Braiam2018-10-08 19:53:16 +00:00Commented Oct 8, 2018 at 19:53
9 Answers
MySQL provides comprehensive character set management that can help with this kind of problem.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset)
function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.
See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
5 Comments
You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query
SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
This was the most comprehensive query I could come up with.
3 Comments
SELECT * FROM table WHERE LENGTH( column ) != CHAR_LENGTH( column )
'ā'
(encoded by the byte sequence 0x0101
) - it would be deemed "ASCII" using this test: a false negative; indeed, some character sets do not encode ASCII characters within 0x00
to 0x7f
whereupon this solution would yield a false positive. DO NOT RELY UPON THIS ANSWER!LENGTH(column)
will be a constant multiple of CHAR_LENGTH(column)
irrespective of the value.It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';
That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';
The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.
9 Comments
SELECT * FROM tbl WHERE colname NOT REGEXP '^[A-Za-z0-9\.,@&\(\) \-]*$';
This is probably what you're looking for:
select * from TABLE where COLUMN regexp '[^ -~]';
It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).
8 Comments
REGEXP
and RLIKE
operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal."Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:
SELECT * FROM `table` WHERE NOT `field` REGEXP "[\\x00-\\xFF]|^$";
It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike @Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)
It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:
SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP "[\\x00-\\xFF]";
It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.
Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!
2 Comments
REGEXP
is checking. Hence it is guaranteed to always match. Also ^$
is probably not what you wanted.