Python encoding issue while reading a file

Question

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.

readFile = codecs.open("FileName",encoding='utf-8')

The line I am trying to read is this with nothing else in it.

Aeëtes

Here are some of the errors I get:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte

UTF-16 stream does not start with BOM" UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.

UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)

If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.

Please provide some error or unexpected results. There is also "utf-8-sig" encoding that might helps. — C.LECLERC
– C.LECLERC, Commented Sep 9, 2016 at 16:32
Where was the file written? Was it perhaps in an environment that uses some weird encoding like Windows legacy code pages? See a similar question here: stackoverflow.com/q/6344853/2988730 — Mad Physicist
– Mad Physicist, Commented Sep 9, 2016 at 17:14
Have you tried constructing a list of possible encodings and looped through them until one works? — Mad Physicist
– Mad Physicist, Commented Sep 9, 2016 at 18:30
So I searched for an encoding where 0x91 represents a character ë. As for your "One More Thing" note: imagine my surprise when this turned out to be so in Mac Roman Encoding. — Jongware
– Jongware, Commented Sep 9, 2016 at 20:15

Jongware · Accepted Answer · 2016-09-09 22:34:23Z

1

Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.

The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.

This is the line of code that I used to convert it.

readFile = codecs.open("FileName",encoding='mac_roman')

edited Sep 9, 2016 at 22:34

Jongware

22.6k8 gold badges56 silver badges105 bronze badges

answered Sep 9, 2016 at 22:23

Jimmy

1751 gold badge3 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python encoding issue while reading a file

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related