8

I'm reading a file that contains Romanian words in Python with file.readline(). I've got problem with many characters because of encoding.

Example :

>>> a = "aberație"  #type 'str'
>>> a -> 'abera\xc8\x9bie'
>>> print sys.stdin.encoding
UTF-8

I've tried encode() with utf-8, cp500 etc, but it doesn't work.

I can't find which is the right Character encoding I have to use ?

thanks in advance.

Edit: The aim is to store the word from file in a dictionnary, and when printing it, to obtain aberație and not 'abera\xc8\x9bie'

1 Answer 1

15

What are you trying to do?

This is a set of bytes:

BYTES = 'abera\xc8\x9bie'

It's a set of bytes which represents a utf-8 encoding of the string "aberație". You decode the bytes to get your unicode string:

>>> BYTES 
'abera\xc8\x9bie'
>>> print BYTES 
aberație
>>> abberation = BYTES.decode('utf-8')
>>> abberation 
u'abera\u021bie'
>>> print abberation 
aberație

If you want to store the unicode string to a file, then you have to encode it to a particular byte format of your choosing:

>>> abberation.encode('utf-8')
'abera\xc8\x9bie'
>>> abberation.encode('utf-16')
'\xff\xfea\x00b\x00e\x00r\x00a\x00\x1b\x02i\x00e\x00'
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.