python: unicode problem

Question

I am trying to decode a string I took from file:

file = open ("./Downloads/lamp-post.csv", 'r')
data = file.readlines()
data[0]

'\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00\t\x00C\x00o\x00m\x00p\x00e\x00t\x00i\x00t\x00i\x00o\x00n\x00\t\x00G\x00l\x00o\x00b\x00a\x00l\x00 \x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\t\x00D\x00e\x00c\x00 \x002\x000\x001\x000\x00\t\x00N\x00o\x00v\x00 \x002\x000\x001\x000\x00\t\x00O\x00c\x00t\x00 \x002\x000\x001\x000\x00\t\x00S\x00e\x00p\x00 \x002\x000\x001\x000\x00\t\x00A\x00u\x00g\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00l\x00 \x002\x000\x001\x000\x00\t\x00J\x00u\x00n\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00y\x00 \x002\x000\x001\x000\x00\t\x00A\x00p\x00r\x00 \x002\x000\x001\x000\x00\t\x00M\x00a\x00r\x00 \x002\x000\x001\x000\x00\t\x00F\x00e\x00b\x00 \x002\x000\x001\x000\x00\t\x00J\x00a\x00n\x00 \x002\x000\x001\x000\x00\t\x00A\x00d\x00 \x00s\x00h\x00a\x00r\x00e\x00\t\x00S\x00e\x00a\x00r\x00c\x00h\x00 \x00s\x00h\x00a\x00r\x00e\x00\t\x00E\x00s\x00t\x00i\x00m\x00a\x00t\x00e\x00d\x00 \x00A\x00v\x00g\x00.\x00 \x00C\x00P\x00C\x00\t\x00E\x00x\x00t\x00r\x00a\x00c\x00t\x00e\x00d\x00 \x00F\x00r\x00o\x00m\x00 \x00W\x00e\x00b\x00 \x00P\x00a\x00g\x00e\x00\t\x00L\x00o\x00c\x00a\x00l\x00 \x00M\x00o\x00n\x00t\x00h\x00l\x00y\x00 \x00S\x00e\x00a\x00r\x00c\x00h\x00e\x00s\x00\n'

Adding ignore do not really help...:

In [69]: data[2] Out[69]: u'\u6700\u6100\u7200\u6400\u6500\u6e00\u2000\u6c00\u6100\u6d00\u7000\u2000\u7000\u6f00\u7300\u7400\u0900\u3000\u2e00\u3900\u3400\u0900\u3800\u3800\u3000\u0900\u2d00\u0900\u3300\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3900\u3000\u0900\u3400\u3800\u3000\u0900\u3500\u3900\u3000\u0900\u3500\u3900\u3000\u0900\u3700\u3200\u3000\u0900\u3700\u3200\u3000\u0900\u3300\u3900\u3000\u0900\u3300\u3200\u3000\u0900\u3200\u3600\u3000\u0900\u2d00\u0900\u2d00\u0900\ua300\u3200\u2e00\u3100\u3800\u0900\u2d00\u0900\u3400\u3800\u3000\u0a00'

In [70]: data[2].decode("utf-8", "replace") --------------------------------------------------------------------------- Traceback (most recent call last)

/Users/oleg/ in ()

/opt/local/lib/python2.5/encodings/utf_8.py in decode(input, errors) 14 15 def decode(input, errors='strict'): ---> 16 return codecs.utf_8_decode(input, errors, True) 17 18 class IncrementalEncoder(codecs.IncrementalEncoder):

: 'ascii' codec can't encode characters in position 0-87: ordinal not in range(128)

In [71]:

My answer works without the error. But it depends wether you want to ignore or replace the undecodeable characters. — orlp
– orlp, Commented Jan 19, 2011 at 13:21

Sven Marnach · Accepted Answer · 2011-01-19 13:30:58Z

20

This looks like UTF-16 data. So try

data[0].rstrip("\n").decode("utf-16")

Edit (for your update): Try to decode the whole file at once, that is

data = open(...).read()
data.decode("utf-16")

The problem is that the line breaks in UTF-16 are "\n\x00", but using readlines() will split at the "\n", leaving the "\x00" character for the next line.

edited Jan 19, 2011 at 13:30

answered Jan 19, 2011 at 13:10

Sven Marnach

607k123 gold badges966 silver badges865 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Oleg Tarasenko Over a year ago

Strange, it fails for next line:

tzot · Accepted Answer · 2011-02-13 14:15:08Z

11

This file is a UTF-16-LE encoded file, with an initial BOM.

import codecs

fp= codecs.open("a", "r", "utf-16")
lines= fp.readlines()

edited Feb 13, 2011 at 14:15

answered Feb 13, 2011 at 11:42

tzot

96.5k30 gold badges151 silver badges210 bronze badges

1 Comment

John Machin Over a year ago

-1 balderdash. >>> raw = '\xff\xfeK\x00e\x00y\x00w\x00o\x00r\x00d\x00' >>> raw.decode('utf_16le') u'\ufeffKeyword' >>> raw.decode('utf_16') u'Keyword' >>>

orlp · Accepted Answer · 2011-01-19 13:20:19Z

3

EDIT

Since you posted 2.7 this is the 2.7 solution:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "replace") for line in file]

Ignoring undecodeable characters:

file = open("./Downloads/lamp-post.csv", "r")
data = [line.decode("utf-16", "ignore") for line in file]

edited Jan 19, 2011 at 13:20

answered Jan 19, 2011 at 13:08

orlp

119k39 gold badges226 silver badges324 bronze badges

7 Comments

Oleg Tarasenko Over a year ago

In [21]: file = open ("./Downloads/lamp-post.csv", 'r') In [22]: data = [line.decode() for line in file] --------------------------------------------------------------------------- <type 'exceptions.UnicodeDecodeError'> Traceback (most recent call last) /Users/oleg/<ipython console> in <module>() <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) In [23]: data = [line.decode() for line in file]

orlp Over a year ago

Ohh, do you want to ignore those invalid characters or replace them? Edited my answer assuming replacement.

Thomas K Over a year ago

In Python 3, files are opened in unicode mode by default. So they will not have a decode method.

Thomas K Over a year ago

I undid the downvote. But there's still a better way in Python 3: use the encoding argument for open. open("Downloads/lamp-post.csv", encoding="utf-16").

Oleg Tarasenko Over a year ago

Strange data do not seem to be changed... e.g. I see same array of utf-16 calling data

|

Collectives™ on Stack Overflow

python: unicode problem

3 Answers 3

1 Comment

1 Comment

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

7 Comments

Linked

Related