From the question and answer in UTF-8 coding in Python, I could use binascii package to decode an utf-8 string with '_' in it.
def toUtf(r):
try:
rhexonly = r.replace('_', '')
rbytes = binascii.unhexlify(rhexonly)
rtext = rbytes.decode('utf-8')
except TypeError:
rtext = r
return rtext
This code works fine with only utf-8 characters:
r = '_ed_8e_b8'
print toUtf(r)
>> 편
However, this code does not work when the string has normal ascii code in it. The ascii can be anywhere in the string.
r = '_2f119_ed_8e_b8'
print toUtf(r)
>> doesn't work - _2f119_ed_8e_b8
>> this should be '/119편'
Maybe, I can use regular expression to extract the utf-8 part and ascii part to reassmeble after the conversion, but I wonder if there is an easier way to do the conversion. Any good solution?
flags=re.I, notre.Iafterr, or the regex is run case-sensitively, and won't do more than two replacements (because oops, turns outre.subtakes an optionalcountargument before theflagsargument). Also, the outermost parens in the pattern are only needed for there.splitapproach; forre.sub, they can (and should, for minor performance gains) be omitted.