UTF-8 decoding with ascii code in it with Python

Question

From the question and answer in UTF-8 coding in Python, I could use binascii package to decode an utf-8 string with '_' in it.

def toUtf(r):
    try:
        rhexonly = r.replace('_', '')
        rbytes = binascii.unhexlify(rhexonly)
        rtext = rbytes.decode('utf-8')
    except TypeError:
        rtext = r
    return rtext

This code works fine with only utf-8 characters:

r = '_ed_8e_b8'
print toUtf(r)
>> 편

However, this code does not work when the string has normal ascii code in it. The ascii can be anywhere in the string.

r = '_2f119_ed_8e_b8'
print toUtf(r)
>> doesn't work - _2f119_ed_8e_b8
>> this should be '/119편'

Maybe, I can use regular expression to extract the utf-8 part and ascii part to reassmeble after the conversion, but I wonder if there is an easier way to do the conversion. Any good solution?

You should probably ask @ShadowRanger in a comment on his answer. — zondo
– zondo, Commented Feb 3, 2016 at 2:26
As a rule, you shouldn't edit answers into questions. Also, note the edit I made to @chthonicdaemon's answer; you need to pass flags=re.I, not re.I after r, or the regex is run case-sensitively, and won't do more than two replacements (because oops, turns out re.sub takes an optional count argument before the flags argument). Also, the outermost parens in the pattern are only needed for the re.split approach; for re.sub, they can (and should, for minor performance gains) be omitted. — ShadowRanger
– ShadowRanger, Commented Feb 3, 2016 at 2:56

chthonicdaemon · Accepted Answer · 2016-02-03 03:46:35Z

2

Quite straightforward with re.sub:

import re

bytegroup = r'(_[0-9a-z]{2})+'

def replacer(match):
    return toUtf(match.group())

rtext = re.sub(bytegroup, replacer, r, flags=re.I)

edited Feb 3, 2016 at 3:46

answered Feb 3, 2016 at 2:39

chthonicdaemon

19.9k2 gold badges55 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ShadowRanger Over a year ago

I should have remembered re.sub with a function as the shorthand from the re.split + post-processing and re-''.join. It's a little more magical, but it's the better solution. Unless you object, I'll copy it in as the short form in my answer to make it complete (the spread out bit illustrates the pieces, re.sub is the all-in-one). I'm upvoting you regardless.

chthonicdaemon Over a year ago

No problem with copying it.

prosseek Over a year ago

I checked that r'(_[0-9a-z]{2})+' also works fine, do you have any reason to use r'(?:_[0-9a-z]{2})+'?

chthonicdaemon Over a year ago

@prosseek The non-capturing group was necessary for one of my early iterations on the solution, but not anymore. I've edited it out.

Community · Accepted Answer · 2017-05-23 12:31:01Z

That is some truly terrible input you've got. It's still fixable though. First off, replace the non-"encoded" stuff with hex equivalents:

import itertools
import re

r = '_2f119_ed_8e_b8'

# Split so you have even entries in the list as ASCII, odd as hex encodings
rsplit = re.split(r'((?:_[0-9a-fA-F]{2})+)', r)   # ['', '_2f', '119', '_ed_8e_b8', '']

# Process the hex encoded UTF-8 with your existing function, leaving
# ASCII untouched
rsplit[1::2] = map(toUtf, rsplit[1::2])  # ['', '/', '119', '관', '']

rtext = ''.join(rsplit)  # '/119편'

The above is a verbose version that shows the individual steps, but as chthonicdaemon's answer point's out, it can be shortened dramatically. You use the same regular expression with re.sub instead of re.split, and pass a function to perform the replacement instead of a replacement pattern string:

# One-liner equivalent to the above with no intermediate lists
rtext = re.sub(r'(?:_[0-9a-f]{2})+', lambda m: toUtf(m.group()), r, flags=re.I)

You can package that up as a function itself, so you have one function that deals with purely hex encoded UTF-8, and a second general function that uses the first function as part of processing mixed non-encoded ASCII and hex encoded UTF-8 data.

Mind you, this won't necessarily work all that well if the non-encoded ASCII might contain _ normally; the regex tries to be as targeted as possible, but you've got a problem here where no matter how finely you target your heuristics, some ASCII data will be mistaken for encoded UTF-8 data.

@Kevin: It's not even vs odd characters, it's even vs. odd split results. The re.split return value does give you even->ASCII, odd->encoded automatically. I'll add the example intermediate values.

Collectives™ on Stack Overflow

UTF-8 decoding with ascii code in it with Python

2 Answers 2

4 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Linked

Related