Unicode Substitutions using Regex , Python

Question

I have a string as follows:

str1 = "heylisten\uff08there is something\uff09to say \uffa9"

I need to replace the unicode values detected by my regex expression with spaces on either sides.

Desired output string:

out = "heylisten \uff08 there is something \uff09 to say  \uffa9 "

I have used an re.findall to get all the matches and then replace them. It looks like:

p1 = re.findall(r'\uff[0-9a-e][0-9]', str1, flags = re.U)  
out = str1
for item in p1:
    print item
    print out
    out= re.sub(item, r" " + item + r" ", out)

And this outputs:

'heylisten\\ uff08 there is something\\ uff09 to say \\ uffa9 '

What is wrong with the above that it prints an extra "\" and also separates it from uff? I even tried with re.search but it seems to only separate \uff08. Is there a better way?

I didn't get you . I want spaces on either sides on each match. But the \ seem to separate. — Hypothetical Ninja
– Hypothetical Ninja, Commented Nov 5, 2014 at 8:59

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

I have a string as follows:
str1 = "heylisten\uff08there is something\uff09to say \uffa9"
I need to replace the unicode values ...

You don't have any unicode values. You have a bytestring.

str1 = u"heylisten\uff08there is something\uff09to say \uffa9"
 ...
p1 = re.sub(ur'([\uff00-\uffe9])', r' \1 ', str1)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Nov 5, 2014 at 9:03

Ignacio Vazquez-Abrams

803k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Hypothetical Ninja Over a year ago

it isn't working.. outputs 'heylisten\\uff08there is something\\uff09to say \\uffa9'

Hypothetical Ninja Over a year ago

yeah i read it and I guess I framed the example wrong.. it works with the u outside..

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

print re.sub(r"(\\uff[0-9a-e][0-9])", r" \1 ", x)

You can directly use this re.sub. See demo.

http://regex101.com/r/sU3fA2/67

import re
p = re.compile(ur'(\\uff[0-9a-e][0-9])', re.UNICODE)
test_str = u"heylisten\uff08there is something\uff09to say \uffa9"
subst = u" \1 "

result = re.sub(p, subst, test_str)

Output:

heylisten \uff08 there is something \uff09 to say  \uffa9

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Nov 5, 2014 at 8:58

vks

68.1k11 gold badges96 silver badges132 bronze badges

2 Comments

Hypothetical Ninja Over a year ago

your import re code ouputs this : u'heylisten\uff08there is something\uff09to say \uffa9'

vks Over a year ago

@Swordy directly use print re.sub(r"(\\uff[0-9a-e][0-9])", r" \1 ", x) x is uur string.

Collectives™ on Stack Overflow

Unicode Substitutions using Regex , Python

2 Answers 2

2 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Related