Python, Encoding output to UTF-8

Question

I have a definition that builds a string composed of UTF-8 encoded characters. The output files are opened using 'w+', "utf-8" arguments.

However, when I try to x.write(string) I get the UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 1: ordinal not in range(128)

I assume this is because normally for example you would do `print(u'something'). But I need to use a variable and the quotations in u'_' negate that...

Any suggestions?

EDIT: Actual code here:

source = codecs.open("actionbreak/" + target + '.csv','r', "utf-8")
outTarget = codecs.open("actionbreak/" + newTarget, 'w+', "utf-8")
x = str(actionT(splitList[0], splitList[1]))
outTarget.write(x)

Essentially all this is supposed to be doing is building me a large amount of strings that look similar to this:

[日木曜 Deliverables]= CASE WHEN things = 11 THEN C ELSE 0 END

Please show the actual code you're using to open the file, and where you're getting u'\ufeff' from. — Wooble
– Wooble, Commented Jul 10, 2013 at 18:07
Is the error occurring on the write line or the str one? — Mark Ransom
– Mark Ransom, Commented Jul 10, 2013 at 19:01

JAB · Accepted Answer · 2013-07-10 18:58:04Z

Are you using codecs.open()? Python 2.7's built-in open() does not support a specific encoding, meaning you have to manually encode non-ascii strings (as others have noted), but codecs.open() does support that and would probably be easier to drop in than manually encoding all the strings.

As you are actually using codecs.open(), going by your added code, and after a bit of looking things up myself, I suggest attempting to open the input and/or output file with encoding "utf-8-sig", which will automatically handle the BOM for UTF-8 (see http://docs.python.org/2/library/codecs.html#encodings-and-unicode, near the bottom of the section) I would think that would only matter for the input file, but if none of those combinations (utf-8-sig/utf-8, utf-8/utf-8-sig, utf-8-sig/utf-8-sig) work, then I believe the most likely situation would be that your input file is encoded in a different Unicode format with BOM, as Python's default UTF-8 codec interprets BOMs as regular characters so the input would not have an issue but output could.

Just noticed this, but... when you use codecs.open(), it expects a Unicode string, not an encoded one; try x = unicode(actionT(splitList[0], splitList[1])).

Your error can also occur when attempting to decode a unicode string (see http://wiki.python.org/moin/UnicodeEncodeError), but I don't think that should be happening unless actionT() or your list-splitting does something to the Unicode strings that causes them to be treated as non-Unicode strings.

Yes, I am using codecs.open(). Even so, it seems like I am running into this error still.
@RazzleDazzle It's a BOM issue, then. I made some additions to my answer, try them out and see what works (if anything does).
I feel we are getting closer.... following your suggestion to use the encoding that skips the BOM, I now receive this error after about 3 strings are successfully passed into the output file: UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
This appears to be the string messing up execution at runtime: [テスト]= CASE WHEN T THEN c ELSE 0 END So it seems the japanese characters are causing an encoding error. The input file's encoding is controllable by me. I have been using a UTF-8 encoded .csv as the input in order to preserve the japanese characters
@RazzleDazzle It seems I may have discovered the second part of the issue, see my once-again updated answer.

stalk · Accepted Answer · 2013-07-10 18:23:45Z

In python 2.x there are two types of string: byte string and unicode string. First one contains bytes and last one - unicode code points. It is easy to determine, what type of string it is - unicode string starts with u:

# byte string
>>> 'abc'
'abc'

# unicode string:
>>> u'abc абв'
u'abc \u0430\u0431\u0432'

'abc' chars are the same, because the are in ASCII range. \u0430 is a unicode code point, it is out of ASCII range. "Code point" is python internal representation of unicode points, they can't be saved to file. It is needed to encode them to bytes first. Here how encoded unicode string looks like (as it is encoded, it becomes a byte string):

>>> s = u'abc абв'
>>> s.encode('utf8')
'abc \xd0\xb0\xd0\xb1\xd0\xb2'

This encoded string now can be written to file:

>>> s = u'abc абв'
>>> with open('text.txt', 'w+') as f:
...     f.write(s.encode('utf8'))

Now, it is important to remember, what encoding we used when writing to file. Because to be able to read the data, we need to decode the content. Here what data looks like without decoding:

>>> with open('text.txt', 'r') as f:
...     content = f.read()
>>> content
'abc \xd0\xb0\xd0\xb1\xd0\xb2'

You see, we've got encoded bytes, exactly the same as in s.encode('utf8'). To decode it is needed to provide coding name:

>>> content.decode('utf8')
u'abc \u0430\u0431\u0432'

After decode, we've got back our unicode string with unicode code points.

>>> print content.decode('utf8')
abc абв

He's using codecs.open(), explicit encoding/decoding isn't needed.
While this doesn't solve my issue, I am grateful for the lesson. This is very interesting information that I didn't know before.

Slater Victoroff · Accepted Answer · 2013-07-10 18:07:59Z

1

xgord is right, but for further edification it's worth noting exactly what \ufeff means. It's known as a BOM or a byte order mark and basically it's a callback to the early days of unicode when people couldn't agree which way they wanted their unicode to go. Now all unicode documents are prefaced with either an \ufeff or an \uffef depending on which order they decide to arrange their bytes in.

If you hit an error on those characters in the first location you can be sure the issue is that you are not trying to decode it as utf-8, and the file is probably still fine.

answered Jul 10, 2013 at 18:07

Slater Victoroff

22k23 gold badges92 silver badges149 bronze badges

2 Comments

JAB Over a year ago

You're slightly incorrect, the BOM will always be \ufeff. The actual encoding of the BOM will differ, but the codepoint is always U+FEFF. If you're reading it as \uffef, you've got your endianness flipped.

JAB Over a year ago

(And if it's b'\xef\xbb\xbf' then the encoding is UTF-8, of course.)

Collectives™ on Stack Overflow

Python, Encoding output to UTF-8

3 Answers 3

8 Comments

2 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

2 Comments

2 Comments

Linked

Related