Read file with unknown encoding

Question

I'm trying to load the columns of a file with a strange encoding. Windows appears to have no issues opening it, but Linux complains and I have only been able to open it using the Atom text editor (others give me either a blank file or a file with data encoded)

The command:

file -i data_file.tit

returns:

application/octet-stream; charset=binary

Opening the file in binary mode and reading the first 400 bytes gives:

'0905077U1- a\r\nIntegration time: 19,00 ms\r\nAverage: 25 scans\r\nNr of pixels used for smoothing: 2\r\nData measured with spectrometer name: 0905077U1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\r\nWave ;Dark ;Ref ;Sample ;Absolute Irradiance ;Photon Counts\r\n[nm] ;[counts] ;[counts] ;[counts] ;[\xb5Watt/cm\xb2/nm] ;[\xb5Mol/s/m\xb2/nm]\r\n247,40;-1,0378;18,713;10,738;21,132;0,4369\r\n247,'

The rest of the file consists only of ASCII numbers separated by semicolons.

I tried the following ways to load the file:

with open('data_file.tit') as f:
    bytes = f.read() # (1)
    # bytes = f.read().decode('???')  # (2)
    # bytes = np.genfromtxt(f)  # (3)
    print bytes

(1) Sort of works but skips the first several hundred lines.

(2) Failed with every encoding I tried with the error:

codec can't decode byte 0xb5 in position 315: unexpected special character

(3) Complains about ValueError: Some errors were detected ! and shows for each line something similar to Line #3 (got 3 columns instead of 2).

How can I load this data file?

We cannot possibly know. You have random data, we are not clairvoyants I am afraid. — Martijn Pieters
– Martijn Pieters, Commented Sep 30, 2014 at 13:25
@MartijnPieters what do you mean? I posted a link to the file, I'm not hiding it. — Gabriel
– Gabriel, Commented Sep 30, 2014 at 13:26
Your question needs to be self-contained however; don't expect people to download random data from the internets! And guessing at the encoding of your file is not going to be helpful to anyone else. — Martijn Pieters
– Martijn Pieters, Commented Sep 30, 2014 at 13:28
How else can I share the data file? If I paste the contents here, won't they be overwritten or the encoding changed by the page? And finding a way to tell the encoding is part of the question. — Gabriel
– Gabriel, Commented Sep 30, 2014 at 13:29
repr() can give you Python representations of the data. Open the file in binary mode ('rb') and give us a sample perhaps. — Martijn Pieters
– Martijn Pieters, Commented Sep 30, 2014 at 13:30

Martijn Pieters · Accepted Answer · 2014-09-30 13:48:26Z

6

You have a codepage 1252 encoded text file, with one line containing NULL bytes. The file command determined you have binary data on the basis of those NULLs, while I made an educated guess on the basis of the \xb2 and \xb5 codepoints, which stand for the ² and µ characters.

To open, just decode from that encoding:

import io

with io.open(filename, 'r', encoding='cp1252') as f:
    for line in f:
        print(line.rstrip('\n\x00'))

The first 10 lines are then:

0905077U1- a
Integration time: 19,00 ms
Average: 25 scans
Nr of pixels used for smoothing: 2
Data measured with spectrometer name: 0905077U1
Wave   ;Dark     ;Ref      ;Sample   ;Absolute Irradiance  ;Photon Counts
[nm]   ;[counts] ;[counts] ;[counts] ;[µWatt/cm²/nm]       ;[µMol/s/m²/nm]
247,40;-1,0378;18,713;10,738;21,132;0,4369
247,57;3,0793;19,702;9,5951;11,105;0,2298
247,74;-0,9414;19,929;8,8908;16,567;0,3430

The NULLs were stripped from the Data measured with spectrometer name: 0905077U1 line; the spetrometer name is now 9 bytes long, together with the 55 NULLs it looks like the name could be up to 64 characters long and the file writer didn't bother to strip those NULLs.

edited Sep 30, 2014 at 13:48

answered Sep 30, 2014 at 13:34

Martijn Pieters

1.1m325 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Gabriel Over a year ago

Thanks @Martijn. When I try this I get: UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position 39: ordinal not in range(128). The issue is with the $\mu$ chars in line 7 I believe. Is there a way to skip reading these?

Martijn Pieters Over a year ago

@Gabriel: you are trying to decode data that is already Unicode. Don't do that.

Gabriel Over a year ago

I'm just trying to read the file using your answer, nothing else I swear. I don't know why I get that error, I've tried stripping those characters using \xb5 instead of \n\x00 in your answer but it doesn't work.

Martijn Pieters Over a year ago

@Gabriel: try print line.rstrip('\n\x00').encode('ascii', 'replace'); that'll force an encoding to ASCII ignoring the 4 non-ASCII characters. That way you can at least see the file contents.

Martijn Pieters Over a year ago

@Gabriel: then your console or terminal can only handle ASCII output.

|

ojii · Accepted Answer · 2014-09-30 13:26:20Z

6

Guessing an encoding can be really hard, luckily there's a library that tries to help with that: https://pypi.python.org/pypi/chardet

answered Sep 30, 2014 at 13:26

ojii

4,7812 gold badges26 silver badges36 bronze badges

2 Comments

Bill Over a year ago

May be a stupid question, but can't you use try: except: repeatedly with different guesses at the encoding until you get no errors? Would this work? I'm writing code that reads csv files but for a client that will be creating his own files so I have no idea what encoding they might be in. Thanks.

tbc0 Over a year ago

Not stupid, but also not Pythonic. chardet does the trick, so just use it.

Collectives™ on Stack Overflow

Read file with unknown encoding

2 Answers 2

7 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Linked

Related