3

My Python script creates a xml file under Windows XP but that file doesn't get the right encoding with Spanish characters such 'ñ' or some accented letters.

First of all, the filename is read from an excel shell with the following code, I use to read the Excel file xlrd libraries:

filename = excelsheet.cell_value(rowx=first_row, colx=5)

Then, I've tried some encodings without success to generate the file with the right encode:

filename = filename[:-1].encode("utf-8")
filename = filename[:-1].encode("latin1")
filename = filename[:-1].encode("windows-1252")

Using "windows-1252" I get a bad encoding with letter 'ñ', 'í' and 'é'. For example, I got BAJO ARAGÓN_Alcañiz.xml instead of BAJO ARAGÓN_Alcañiz.xml

Thanks in advance for your help

3
  • Does the file-system support unicode? (Try to make a file with unicode chrs in explorer or whatever) Commented Oct 23, 2012 at 14:17
  • Aw, sorry, wrong understanding of .encode(). Try unicode(filename)? Commented Oct 23, 2012 at 14:25
  • Did you try to use chardet to guess the encoding? Commented Oct 23, 2012 at 16:33

4 Answers 4

1

You should use unicode strings for your filenames. In general operating systems support filenames that contain arbitrary Unicode characters. So if you do:

fn = u'ma\u00d1o'  # maÑo
f = open(fn, "w")
f.close()
f = open(fn, "r")
f.close()

it should work just fine. A different thing is what you see in your terminal when you list the content of the directory where that file lives. If the encoding of the terminal is UTF-8 you will see the filename maño, but if the encoding is for instance iso-8859-1 you will see maÃo. But even if you see these strange characters you should be able to open the file from python as described above.

In summary, do not encode the output of

filename = excelsheet.cell_value(rowx=first_row, colx=5)

instead make sure it is a unicode string.

Reading the Unicode filenames section of the Python Unicode HOWTO can be helpful for you.

Sign up to request clarification or add additional context in comments.

1 Comment

I don't think this is really true cross-platform. Unix filenames are byte-strings. Using a unicode filename when running in Unix causes the default encoding (ASCII) to be applied. However, a file named u'ma\u00d1o'.encode('UTF-8') is perfectly OK under Unix.
1

Trying your answers I found a fast solution, port my script from Python 2.7 yo Python 3.3, the reason to port my code is Python 3 works by default in Unicode.

I had to do some little changes in my code, the import of xlrd libraries (Previously I had to install xlrd3):

import xlrd3 as xlrd

Also, I had to convert the content from 'bytes' to 'string' using str instead of encode()

filename = str(filename[:-1])

Now, my script works perfect and generate the files on Windows XP without strange characters.

Comments

0

First, if you had not, please, read http://www.joelonsoftware.com/articles/Unicode.html -

Now, "latin-1" should work for Spanish encoding under Windows - there are two hypotheses tehr: the strigns you are trying to "encode" to either encoding are not Unicdoe strings, but are already in some encoding. tha, however, would more likely give you an UnicodeDecodeError than strange characters, but it might work in some corner case.

The more likely case is that you are checking your files using the windows Prompt AKA 'CMD" - Well, for some reason, Microsoft Windows does use two different encodings for the system - one from inside "native" windows programs - which should be compatible with latin1, and another one for legacy DOS programs, in which category it puts the command prompt. For Portuguese, this second encoding is "cp852" (Looking around, cp852 does not define "ñ" - but cp850 does ).

So, this happens:

>>> print u"Aña".encode("latin1").decode("cp850")
A±a
>>> 

So, if you want your filenames to appear correctly from the DOS prompt, you should encode them using "CP850" - if you want them to look right from Windows programs, do encode them using "cp1252" (or "latin1" or "iso-8859-15" - they are almost the same, give or take the "€" symbol)

Of course, instead of trying to guess and picking one that looks good, and will fail if some one runs your program in Norway, Russia, or in aa Posix system, you should just do

import sys
encoding = sys.getfilesystemencoding()

(This should return one of the above for you - again, the filename will look right if seem from a Windows program, not from a DOS shell)

Comments

0

In Windows, the file system uses UTF-16, so no explicit encoding is required. Just use a Unicode string for the filename, and make sure to declare the encoding of the source file.

# coding: utf8
with open(u'BAJO ARAGÓN_Alcañiz.xml','w') as f:
    f.write('test')

Also, even though, for example, Ó isn't supported by the cp437 encoding of my US Windows system, my console font supports the character and it still displays correctly on my console. The console supports Unicode, but non-Unicode programs can only read/write code page characters.

1 Comment

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.