Replace non-ASCII characters with a single space

Question

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the – character is replaced with 3 spaces):

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

How can I replace all non-ASCII characters with a single space?

Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.

wow, you really took good efforts to show so many links. +1 as soon as the day renews! — shad0w_wa1k3r
– shad0w_wa1k3r, Commented Nov 19, 2013 at 18:20
You seem to have missed this one stackoverflow.com/questions/1342000/… — Stuart
– Stuart, Commented Nov 19, 2013 at 18:35
@Stuart: Thanks, but that is the very first one that I mention. — dotancohen
– dotancohen, Commented Nov 20, 2013 at 9:08
@dstromberg: I mention a problematic example character in the question: –. It's this guy. — dotancohen
– dotancohen, Commented Nov 20, 2013 at 11:52
@jubilatious1 At this stage of the question's life, perhaps sed, awk, and perl answers would be interesting even if they are OT. But I would recommend putting them all in a single "X/Y answer", not separate answers. Usually a sed, awk, or perl answer could replace a Python answer if the code is running from e.g. a bash CLI where all four are generally available, not where actual Python scripts are running. — dotancohen
– dotancohen, Commented Jun 19, 2022 at 5:48

Martijn Pieters · Accepted Answer · 2013-11-19 18:11:35Z

317

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

answered Nov 19, 2013 at 18:11

Martijn Pieters

1.1m325 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Martijn Pieters Over a year ago

@dstromberg: slower; str.join() needs a list (it'll pass over the values twice), and a generator expression will first be converted to one. Giving it a list comprehension is simply faster. See this post.

Mark Ransom Over a year ago

The first piece of code will insert multiple blanks per character if you feed it a UTF-8 byte string.

Martijn Pieters Over a year ago

@MarkRansom: I was assuming this to be Python 3.

jfs Over a year ago

"– character is replaced with 3 spaces" in the question implies that the input is a bytestring (not Unicode) and therefore Python 2 is used (otherwise ''.join would fail). If OP wants a single space per Unicode codepoint then the input should be decoded into Unicode first.

do-me · Accepted Answer · 2023-05-06 23:07:34Z

72

For you the get the most alike representation of your original string I recommend the unidecode module:

Python 2

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

Then you can use it in a string:

remove_non_ascii("Ceñía")
Cenia

Python 3

from unidecode import unidecode
unidecode("Ceñía")

edited May 6, 2023 at 23:07

do-me

2,3682 gold badges15 silver badges18 bronze badges

answered Feb 18, 2016 at 20:50

Alvaro Fuentes

9677 silver badges7 bronze badges

7 Comments

jxramos Over a year ago

interesting suggestion, but it assumes the user wishes non ascii to become what the rules for unidecode are. This however poses a follow up question to the asker about why they insist on spaces, to perhaps replace with another character?

dotancohen Over a year ago

Thank you, this is a good answer. It doesn't work for the purpose of this question because most of the data that I'm dealing with does not have an ASCII-like representation. Such as דותן. However, in the general sense this is great, thank you!

Alvaro Fuentes Over a year ago

Yes, I know this does not work for this question, but I landed here trying to solve that problem, so I thought I’d just share my solution to my own problem, which I think is very common for people as @dotancohen who deal with non-ascii characters all the time.

Igor Savinkin Over a year ago

@AlvaroFuentes, how to handle/rewrite your wonderful code for Python 3 since this? Error: NameError: global name 'unicode' is not defined

rjurney Over a year ago

This works for Python3 - if you use unidecode(text). I got some quotation marks from funny unicode characters during a crawl this way.

|

Mark Tolonen · Accepted Answer · 2013-11-19 21:26:15Z

32

For character processing, use Unicode strings:

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

But note you will still have a problem if your string contains decomposed Unicode characters (separate character and combining accent marks, for example):

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

edited Nov 19, 2013 at 21:26

answered Nov 19, 2013 at 18:29

Mark Tolonen

180k26 gold badges182 silver badges278 bronze badges

3 Comments

dotancohen Over a year ago

Thank you, this is an important observation. If you do find a logical way to handle the case of combining-marks, I would happily add a bounty to the question. I suppose that simply removing the combining mark yet leaving the uncombined character alone would be best.

Mark Tolonen Over a year ago

A partial solution is to use ud.normalize('NFC',s) to combine marks, but not all combining combinations are represented by single codepoints. You'd need a smarter solution looking at the ud.category() of the character.

jfs Over a year ago

@dotancohen: there is a notion of "user-perceived character" in Unicode that may span several Unicode codepoints. \X (eXtended grapheme cluster) regex (supported by regex module) allows to iterate over such characters (note: "graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes").

AXO · Accepted Answer · 2017-01-03 11:12:33Z

If the replacement character can be '?' instead of a space, then I'd suggest result = text.encode('ascii', 'replace').decode():

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

Results:

0.7208260721400134
0.009975979187503592

Replace the ? with a another character or space afterwards if needed, and you'd still be faster.

parsecer · Accepted Answer · 2016-08-20 22:35:18Z

9

What about this one?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

answered Aug 20, 2016 at 22:35

parsecer

5,22920 gold badges86 silver badges169 bronze badges

5 Comments

dotancohen Over a year ago

Though this is rather inelegant, it is very readable. Thank you.

qneill Over a year ago

+1 for unicode handling... @dotancohen IMNSHO "readable" implies "practical" which adds to "elegant", so i'd say "a bit inelegant"

axolotl Over a year ago

notional -1 for calling non-ascii characters "trash"

parsecer Over a year ago

@axolotl I meant no offense. If I recall correctly when I was writing it I was indeed dealing with very weird characters that are not from any alphabet.

axolotl Over a year ago

I know :) it's a light hearted comment

Kasravnd · Accepted Answer · 2018-01-23 14:39:32Z

9

As a native and efficient approach, you don't need to use ord or any loop over the characters. Just encode with ascii and ignore the errors.

The following will just remove the non-ascii characters:

new_string = old_string.encode('ascii',errors='ignore')

Now if you want to replace the deleted characters just do the following:

final_string = new_string + b' ' * (len(old_string) - len(new_string))

answered Jan 23, 2018 at 14:39

Kasravnd

108k19 gold badges166 silver badges194 bronze badges

2 Comments

Kyle Gibson Over a year ago

In python3, this encode will return a bytestring, so keep that in mind. Also, this method won't strip out characters such as newline.

Hamid Fadishei Over a year ago

new_string = old_string.encode('ascii', errors='ignore').decode()

Yunnosch · Accepted Answer · 2020-12-23 08:54:02Z

When we use the ascii() it escapes the non-ascii characters and it doesn't change ascii characters correctly. So my main thought is, it doesn't change the ASCII characters, so I am iterating through the string and checking if the character is changed. If it changed then replacing it with the replacer, what you give.
For example: ' '(a single space) or '?' (with a question mark).

def remove(x, replacer):

     for i in x:
        if f"'{i}'" == ascii(i):
            pass
        else:
            x=x.replace(i,replacer)
     return x
remove('hái',' ')

Result: "h i" (with single space between).

Syntax : remove(str,non_ascii_replacer)
str = Here you will give the string you want to work with.
non_ascii_replacer = Here you will give the replacer which you want to replace all the non ASCII characters with.

Nice edit, adding an explanation. :-) And now that I get the idea of your code I like the approach. (And as promised I did my best with formatting it for you; I hope you like it.)

sklimkovitch · Accepted Answer · 2022-09-23 14:35:11Z

2

def filterSpecialChars(strInput):
    result = []
    for character in strInput:
        ordVal = ord(character)
        if ordVal < 0 or ordVal > 127:
            result.append(' ')
        else:
            result.append(character)
    return ''.join(result)

And call it like this:

result = filterSpecialChars('Ceñía mañana')
print(result)

answered Sep 23, 2022 at 14:35

sklimkovitch

2814 silver badges8 bronze badges

1 Comment

dotancohen Over a year ago

Why are you checking if ord() returns a negative number? Unicode code points are all non-negative integers, but I'll be happy to learn something new. I do agree that it is a good defensive measure, but before that I'd try to catch e.g. a TypeError exception.

smoquet · Accepted Answer · 2021-06-10 10:21:24Z

0

My problem was that my string contained things like BelgiÃ for België and &#x20AC for the € sign. And I didn't want to replace them with spaces. But wth the right symbol itself.

my solution was string.encode('Latin1').decode('utf-8')

answered Jun 10, 2021 at 10:21

smoquet

3814 silver badges11 bronze badges

Comments

jubilatious1 · Accepted Answer · 2022-06-19 02:41:00Z

Pre-processing using Raku (formerly known as Perl_6)

~$ raku -pe 's:g/ <:!ASCII>+ / /;' file

Sample Input:

Peace be upon you
السلام عليكم
שלום עליכם
Paz sobre vosotros

Sample Output:

Peace be upon you


Paz sobre vosotros

Note, you can get extensive information on the matches using the following code:

~$ raku -ne 'say s:g/ <:!ASCII>+ / /.raku;' file
$( )
$(Match.new(:orig("السلام عليكم"), :from(0), :pos(6)), Match.new(:orig("السلام عليكم"), :from(7), :pos(12)))
$(Match.new(:orig("שלום עליכם"), :from(0), :pos(4)), Match.new(:orig("שלום עליכם"), :from(5), :pos(10)))
$( )
$( )

Or more simply, you can just visualize the replacement blank spaces:

~$ raku -ne 'say S:g/ <:!ASCII>+ / /.raku;' file
"Peace be upon you"
"   "
"   "
"Paz sobre vosotros"
""

https://docs.raku.org/language/regexes#Unicode_properties
https://www.codesections.com/blog/raku-unicode/
https://raku.org

Thank you jubilatious. I've upvoted because this is very useful knowledge for me in general, even though it is OT for this Python question. You've been very helpful with Raku / Perl questions and I appreciate that very much!

seaders · Accepted Answer · 2019-04-08 15:03:03Z

Potentially for a different question, but I'm providing my version of @Alvero's answer (using unidecode). I want to do a "regular" strip on my strings, i.e. the beginning and end of my string for whitespace characters, and then replace only other whitespace characters with a "regular" space, i.e.

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

to

"Ceñía mañana"

,

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

We first replace all non-unicode spaces with a regular space (and join it back again),

''.join((c if unidecode(c) else ' ') for c in s)

And then we split that again, with python's normal split, and strip each "bit",

(bit.strip() for bit in s.split())

And lastly join those back again, but only if the string passes an if test,

' '.join(stripped for stripped in s if stripped)

And with that, safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ') correctly returns 'Ceñía mañana'.

flurry_pa · Accepted Answer · 2021-12-09 21:06:47Z

-1

To replace all non-ASCII (\x00-\x7F) characters with a space:

''.join(map(lambda x: x if ord(x) in range(0, 128) else ' ', text))

To replace all visible characters, try this:

import string

''.join(map(lambda x: x if x in string.printable and x not in string.whitespace else ' ', text))

This will give the same result:

''.join(map(lambda x: x if ord(x) in range(32, 128) else ' ', text))

edited Dec 9, 2021 at 21:06

answered Dec 6, 2021 at 21:01

flurry_pa

11 bronze badge

Collectives™ on Stack Overflow

Replace non-ASCII characters with a single space

12 Answers 12

4 Comments

Python 2

Python 3

7 Comments

3 Comments

1 Comment

5 Comments

2 Comments

1 Comment

1 Comment

Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

4 Comments

Python 2

Python 3

7 Comments

3 Comments

1 Comment

5 Comments

2 Comments

1 Comment

1 Comment

Comments

1 Comment

Comments

Comments

Linked

Related