Python 3.x about encoding

Question

# -*- coding: utf-8 -*-

import urllib.request as request

import re

url = "http://jjo.kr/users/38281748"

raw_data = request.urlopen(url).read() #Bytes

decoded = raw_data.decode("utf-8")

print(decoded)

I was trying to get HTML info about that url, but I got error messages.

UnicodeEncodeError: 'cp949' codec can't encode character '\ufeff' in position 2313: illegal multibyte sequence

Am I misunderstanding the fuction decode()?

According to the Python 3.5.2 Standard Library decode "Return a string decoded from the given bytes.".

But I got cp949 instead of a utf-8 string.

Can anyone tell me what's wrong with my code?

From which line does the exception come? I assume it's from the print, that tries to convert to cp949 to work with your terminal? — RemcoGerlich
– RemcoGerlich, Commented Nov 3, 2016 at 8:52

falsetru · Accepted Answer · 2016-11-03 09:15:48Z

1

You've got unicode string by decoding the bytes string.

But as you try to print it, python use cp949 encoding (because it's your stdout encoding = sys.stdout.encoding)

There's \ufeff (ZERO WIDTH NO-BREAK SPACE) which cannot be represented in cp949 encoding.

>>> import unicodedata
>>> unicodedata.name('\ufeff')
'ZERO WIDTH NO-BREAK SPACE'

You can ignore/replace such character by encoding with ignore, replace error-handler.

import sys

decoded = raw_data.decode("utf-8")
decoded = decoded.encode(sys.stdout.encoding, 'ignore').decode(sys.stdout.encoding)
print(decoded)

answered Nov 3, 2016 at 9:15

falsetru

371k69 gold badges768 silver badges659 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

RemcoGerlich Over a year ago

This is correct, except why first encode to stdout's encoding and then decode again?

user7108823 Over a year ago

OMG it actually work! you must be genius. thank you for solve my problem.

falsetru Over a year ago

@RemcoGerlich, You need to encode using stdout's encoding (which is used when you print(..)). Otherwise, it will raise UnicodeEncodeError as OP has got. Without decoding, especially in Python 3, it will print unwanted b'...' (Python 3.x's representation of bytes object)

RemcoGerlich · Accepted Answer · 2016-11-03 08:56:26Z

1

The decoded string contains a \uFEFF character, which is a byte order mark. I have no idea why it occurs in the middle of the page, but encoding it doesn't work.

Remove it with:

decoded = decoded.replace('\ufeff', '')

And it will probably work.

answered Nov 3, 2016 at 8:56

RemcoGerlich

31.3k6 gold badges66 silver badges82 bronze badges

10 Comments

user7108823 Over a year ago

thanks a lot! but I think \ufeff not the only one that's causing a error.

RemcoGerlich Over a year ago

@JakSa: unfortunately I can't test that because I don't have the cp949 encoding installed. falsetru's answer is more general.

falsetru Over a year ago

u'\ufeff' != b'\xfeff'

falsetru Over a year ago

'\ufeff' is not BOM, but ZERO WIDTH NO-BREAK SPACE.

RemcoGerlich Over a year ago

@falsetru: I think that changed in later versions of Unicode, en.wikipedia.org/wiki/Byte_order_mark : "If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be only used as a BOM."

|

Collectives™ on Stack Overflow

Python 3.x about encoding

2 Answers 2

3 Comments

10 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

10 Comments

Related