0

From the question and answer in UTF-8 coding in Python, I could use binascii package to decode an utf-8 string with '_' in it.

def toUtf(r):
    try:
        rhexonly = r.replace('_', '')
        rbytes = binascii.unhexlify(rhexonly)
        rtext = rbytes.decode('utf-8')
    except TypeError:
        rtext = r
    return rtext

This code works fine with only utf-8 characters:

r = '_ed_8e_b8'
print toUtf(r)
>> 편 

However, this code does not work when the string has normal ascii code in it. The ascii can be anywhere in the string.

r = '_2f119_ed_8e_b8'
print toUtf(r)
>> doesn't work - _2f119_ed_8e_b8
>> this should be '/119편'

Maybe, I can use regular expression to extract the utf-8 part and ascii part to reassmeble after the conversion, but I wonder if there is an easier way to do the conversion. Any good solution?

2
  • You should probably ask @ShadowRanger in a comment on his answer. Commented Feb 3, 2016 at 2:26
  • As a rule, you shouldn't edit answers into questions. Also, note the edit I made to @chthonicdaemon's answer; you need to pass flags=re.I, not re.I after r, or the regex is run case-sensitively, and won't do more than two replacements (because oops, turns out re.sub takes an optional count argument before the flags argument). Also, the outermost parens in the pattern are only needed for the re.split approach; for re.sub, they can (and should, for minor performance gains) be omitted. Commented Feb 3, 2016 at 2:56

2 Answers 2

2

Quite straightforward with re.sub:

import re

bytegroup = r'(_[0-9a-z]{2})+'

def replacer(match):
    return toUtf(match.group())

rtext = re.sub(bytegroup, replacer, r, flags=re.I)
Sign up to request clarification or add additional context in comments.

4 Comments

I should have remembered re.sub with a function as the shorthand from the re.split + post-processing and re-''.join. It's a little more magical, but it's the better solution. Unless you object, I'll copy it in as the short form in my answer to make it complete (the spread out bit illustrates the pieces, re.sub is the all-in-one). I'm upvoting you regardless.
No problem with copying it.
I checked that r'(_[0-9a-z]{2})+' also works fine, do you have any reason to use r'(?:_[0-9a-z]{2})+'?
@prosseek The non-capturing group was necessary for one of my early iterations on the solution, but not anymore. I've edited it out.
1

That is some truly terrible input you've got. It's still fixable though. First off, replace the non-"encoded" stuff with hex equivalents:

import itertools
import re

r = '_2f119_ed_8e_b8'

# Split so you have even entries in the list as ASCII, odd as hex encodings
rsplit = re.split(r'((?:_[0-9a-fA-F]{2})+)', r)   # ['', '_2f', '119', '_ed_8e_b8', '']

# Process the hex encoded UTF-8 with your existing function, leaving
# ASCII untouched
rsplit[1::2] = map(toUtf, rsplit[1::2])  # ['', '/', '119', '관', '']

rtext = ''.join(rsplit)  # '/119편'

The above is a verbose version that shows the individual steps, but as chthonicdaemon's answer point's out, it can be shortened dramatically. You use the same regular expression with re.sub instead of re.split, and pass a function to perform the replacement instead of a replacement pattern string:

# One-liner equivalent to the above with no intermediate lists
rtext = re.sub(r'(?:_[0-9a-f]{2})+', lambda m: toUtf(m.group()), r, flags=re.I)

You can package that up as a function itself, so you have one function that deals with purely hex encoded UTF-8, and a second general function that uses the first function as part of processing mixed non-encoded ASCII and hex encoded UTF-8 data.

Mind you, this won't necessarily work all that well if the non-encoded ASCII might contain _ normally; the regex tries to be as targeted as possible, but you've got a problem here where no matter how finely you target your heuristics, some ASCII data will be mistaken for encoded UTF-8 data.

1 Comment

@Kevin: It's not even vs odd characters, it's even vs. odd split results. The re.split return value does give you even->ASCII, odd->encoded automatically. I'll add the example intermediate values.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.