150

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?

4
  • 8
    Ollie Jones has a much better answer (check the bottom). Commented Nov 12, 2012 at 17:45
  • 1
    @JonathanArkell Not on the bottom anymore :) Commented May 22, 2014 at 22:02
  • Correction.. check the middle! ;) Commented May 23, 2014 at 15:26
  • This is the answer @Jonathan is talking about stackoverflow.com/a/11741314/792066 Commented Oct 8, 2018 at 19:53

9 Answers 9

308

MySQL provides comprehensive character set management that can help with this kind of problem.

SELECT whatever
  FROM tableName 
 WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)

The CONVERT(col USING charset) function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.

See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html

You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)

Sign up to request clarification or add additional context in comments.

5 Comments

This is an excellent solution to this problem and much more robust.
this is also useful to find characters with accents (á ä etc) or character not belonging to encoding
much better than using REGEXP (which doesn't seem to work for me for finding accents) and also provides a simple mechanism for making everything ascii again...
This answer works wonderfully and will bring up strings that contain any non-ASCII characters rather than just strings that contain only non-ASCII characters. Thank you!
Outstanding solution!
94

You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query

SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';

This was the most comprehensive query I could come up with.

3 Comments

Best answer so far, but it's even easier like this : SELECT * FROM table WHERE LENGTH( column ) != CHAR_LENGTH( column )
-1 This can yield erroneous results. Suppose, for example, that one has a UTF-16 column containing 'ā' (encoded by the byte sequence 0x0101) - it would be deemed "ASCII" using this test: a false negative; indeed, some character sets do not encode ASCII characters within 0x00 to 0x7f whereupon this solution would yield a false positive. DO NOT RELY UPON THIS ANSWER!
@sun: That doesn't help at all - many character sets are fixed-length and so LENGTH(column) will be a constant multiple of CHAR_LENGTH(column) irrespective of the value.
75

It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:

SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';

That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:

SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';

The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.

9 Comments

Shouldn't you escape the hyphen and period? (Since they do have special meanings in a regular expression.) SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9\.,\-]';
@Tooony No, inside of a set, a period just means itself and the dash only has special meaning between other characters. At the end of the set, it means only itself.
This query only finds all lines in tableName that do not contain an alphanumeric character. This does not answer the question.
That is for columns that don't have any ascii characters at all, so it will miss those with a mix of ascii and non-ascii characters. The answer below from zende checks for one or more non-ascii characters. This helped me for the most part SELECT * FROM tbl WHERE colname NOT REGEXP '^[A-Za-z0-9\.,@&\(\) \-]*$';
This only works (for me anyway) to find strings that contain NONE of those characters. It does not find strings that contain a mix of ASCII and non-ASCII characters.
|
56

This is probably what you're looking for:

select * from TABLE where COLUMN regexp '[^ -~]';

It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).

8 Comments

Works great for me. "regexp '[^ -~]'" means has a character that is before space " " or after "~" or ASCII 32 - 126. All letters, numbers, and symbols, but no unprintable things.
You can even get it as a tee-shirt ;) catonmat.net/blog/my-favorite-regex
Note the warning in the documentation: "The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal."
thanks for this. what i'm wondering is how to replace a replacement character - e.g. �
@mars-o - the black diamond indicates an invalid utf8 character. More discussion here
|
15

One missing character from everyone's examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:

select * from TABLE where COLUMN like '%\0%';

Comments

4

Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:

SELECT * FROM `table` WHERE NOT `field` REGEXP  "[\\x00-\\xFF]|^$";

It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike @Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)

It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:

SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP  "[\\x00-\\xFF]";

It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.

Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!

2 Comments

00-FF includes all possible 8-bit values, which is what REGEXP is checking. Hence it is guaranteed to always match. Also ^$ is probably not what you wanted.
Definitely the best REGEXP solution for finding all 8 bit characters but not as good as the CONVERT(col USING charset) solution which also will allows control characters while limiting display characters to a specific charset.
2

Try Using this query for searching special character records

SELECT *
FROM tableName
WHERE fieldName REGEXP '[^a-zA-Z0-9@:. \'\-`,\&]'

Comments

1

@zende's answer was the only one that covered columns with a mix of ascii and non ascii characters, but it also had that problematic hex thing. I used this:

SELECT * FROM `table` WHERE NOT `column` REGEXP '^[ -~]+$' AND `column` !=''

Comments

-2

In Oracle we can use below.

SELECT * FROM TABLE_A WHERE ASCIISTR(COLUMN_A) <> COLUMN_A;

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.