2
# -*- coding: utf-8 -*-

import urllib.request as request

import re

url = "http://jjo.kr/users/38281748"

raw_data = request.urlopen(url).read() #Bytes

decoded = raw_data.decode("utf-8")

print(decoded)

I was trying to get HTML info about that url, but I got error messages.

UnicodeEncodeError: 'cp949' codec can't encode character '\ufeff' in position 2313: illegal multibyte sequence

Am I misunderstanding the fuction decode()?

According to the Python 3.5.2 Standard Library decode "Return a string decoded from the given bytes.".

But I got cp949 instead of a utf-8 string.

Can anyone tell me what's wrong with my code?

1
  • 1
    From which line does the exception come? I assume it's from the print, that tries to convert to cp949 to work with your terminal? Commented Nov 3, 2016 at 8:52

2 Answers 2

1

You've got unicode string by decoding the bytes string.

But as you try to print it, python use cp949 encoding (because it's your stdout encoding = sys.stdout.encoding)

There's \ufeff (ZERO WIDTH NO-BREAK SPACE) which cannot be represented in cp949 encoding.

>>> import unicodedata
>>> unicodedata.name('\ufeff')
'ZERO WIDTH NO-BREAK SPACE'

You can ignore/replace such character by encoding with ignore, replace error-handler.

import sys

decoded = raw_data.decode("utf-8")
decoded = decoded.encode(sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding)
print(decoded)
Sign up to request clarification or add additional context in comments.

3 Comments

This is correct, except why first encode to stdout's encoding and then decode again?
OMG it actually work! you must be genius. thank you for solve my problem.
@RemcoGerlich, You need to encode using stdout's encoding (which is used when you print(..)). Otherwise, it will raise UnicodeEncodeError as OP has got. Without decoding, especially in Python 3, it will print unwanted b'...' (Python 3.x's representation of bytes object)
1

The decoded string contains a \uFEFF character, which is a byte order mark. I have no idea why it occurs in the middle of the page, but encoding it doesn't work.

Remove it with:

decoded = decoded.replace('\ufeff', '')

And it will probably work.

10 Comments

thanks a lot! but I think \ufeff not the only one that's causing a error.
@JakSa: unfortunately I can't test that because I don't have the cp949 encoding installed. falsetru's answer is more general.
u'\ufeff' != b'\xfeff'
'\ufeff' is not BOM, but ZERO WIDTH NO-BREAK SPACE.
@falsetru: I think that changed in later versions of Unicode, en.wikipedia.org/wiki/Byte_order_mark : "If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be only used as a BOM."
|