2

I want to read files with special file names in Python (2.7). But whatever I try, it always fails to open them. The filenames are

F\xA8\xB9hrerschein

and

Gro\xDFhandel

I know, the encoding was done with one of several codepages. I could try to find out which one and try to convert it and all the mumbo jumbo, but I don't want that.

Can't I somehow tell python to open that file without having to go through all that encoding stuff? I mean opening the file by its raw name in bytes?

7
  • Why don't you want to name them like "Fuehrerschein" or "Grosshandel"? Commented Nov 12, 2015 at 18:54
  • @palsch not everything maps to ASCII, and assuming so is culturally insensitive. People should be able to name their files in their preferred language, and programs should be able to deal with that. Commented Nov 12, 2015 at 18:57
  • @amon right, but in this case... Commented Nov 12, 2015 at 18:57
  • ...it's not the answer, but maybe helping in this case. Commented Nov 12, 2015 at 18:58
  • @fr00tyl00p have you tried looking in the list os.listdir list, what the filename looks like for python? Commented Nov 12, 2015 at 19:04

4 Answers 4

1

After all, I fixed it with

reload(sys)
sys.setdefaultencoding('utf-8')

and setting the environment variable

LANG="C.UTF-8"

Thanks for the hints.

Sign up to request clarification or add additional context in comments.

1 Comment

Downvote for use of sys.setdefaultencoding('utf-8'). This is a nasty hack, which will mask further issues. Having the correct LANG will go along way to help Python, which depends upon a healthy locale.
0

One way is to use os.listdir(). See the following example.

Add some data to a file with non-ascii character 0xdf in the name:

$ echo abcd > `printf "A\xdfA"`

Check that the file contains a non-ascii character:

$ ls A*
A?A

Start Python, read the directory and open the first file (which is the one with the non-ascii character):

$ Python
>>> import os
>>> d = os.listdir('.')
>>> d
['A\xdfA']
>>> f = open(d[0])
>>> f.readline()
'abcd\n'
>>> 

Comments

0

If you have source code like

with open('Großhandel') as input:
    #stuff

You should look at Source Code Encodings and write

 #!python2
 # -*- coding: utf-8 -*-
 with open('Großhandel') as input:
 …

It is worth mention that the authors of PEP-263 are Marc-André Lemburg and Martin von Löwis, which I suppose makes pushing defined toward source encoding back in 2002 slightly more understandable.

2 Comments

The file name comes from the filesystem. It's not written in the code.
So show us the code you have and what is failing, otherwise your question is unanswerable.
0

Under Linux, filenames can be encoded in any character encoding. When opening a file, you must use the exact name encoded to match.

I.e. If the filename is Großhandel.txt encoded using UTF-8, it must be encoded as Gro\xc3\x9fhandel.txt.

If you pass a Unicode string to open(), the user's locale is used to encode the filename, which may match the filename.

Under OS X, UTF-8 encoding is enforced. Under Windows, the character encoding is abstracted by the i/o drivers. A Unicode object passed to open() should always be used for these Operating Systems, where it'll be converted appropriately.

If you're reading filenames from the filesystem, it would be useful to get decoded Unicode filenames to pass straight to open() - Well, you can pass Unicode strings to os.listdir().

E.g.

Locale: LANG=en_GB.UTF-8

A directory with the following files, with their filenames encoded to UTF-8:

test.txt
€.txt

When running Python 2.7 using a string:

>>> os.listdir(".")
['\xe2\x82\xac.txt', 'test.txt']

Using a Unicode path:

>>> os.listdir(u".")
[u'\u20ac.txt', u'test.txt']

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.