0

I am creating a dictionary that requires each letter of a string separated by whitespace. I am using join. The problem is when the string contains non-ascii characters. Join breaks them into two characters and the results is garbage.

Example:

>>> word = 'məsjø'
>>> ' '.join(word)

Gives me:

'm \xc9 \x99 s j \xc3 \xb8'

When what I want is:

'm ə s j ø'

Or even:

'm \xc9\x99 s j \xc3\xb8'
3
  • If this is Python 2.x, you need to define that as a Unicode string literal. Commented Jan 26, 2012 at 17:44
  • On my machine, the ' '.join() works flawlessly with Python 3.x. Can you specify which OS/version of Python you're using? Commented Jan 26, 2012 at 17:54
  • Was using 2.7. Just installed 3.2 and ' '.join() works with no problems! Thx. Commented Jan 26, 2012 at 18:14

1 Answer 1

3

You should use unicode strings, i.e.

word = u'məsjø'

And don't forget to set the encoding of your Python source file at the beginning with

# -*- coding: UTF-8 -*-

(Don't even think about using something other than UTF-8. ;))

Update: This only applies to Python < 3. If you're using Python >= 3, you would probably not have run into these problems in the first place. So if upgrading to 3.x is an option, it's the way to go -- it might not be in some cases because of library dependencies etc., unfortunately.

As mentioned in the comments, encoding issues might also result from a differently configured terminal, although that was not the problem here, apparently.

Sign up to request clarification or add additional context in comments.

8 Comments

Or if the word is read from somewhere else, use word.decode('utf-8') to turn it into unicode.
In Python 3, this restriction is removed. Also, it doesn't expressly answer the question.
I was assuming the OP does not use Python 3 because then this error would be unlikely... But you're right, would be nice to know for sure.
@Makoto: If the asker has run that code and got that result, he/she must be using Python 2. And in that situation, using a unicode literal is a perfectly good answer.
decode / encode worked for my 2.7 installation. Installed 3.2 and didn't need any decoding/encoding lines.Thx.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.