3

When printing a formatted string with a fixed length (e.g, %20s), the width differs from UTF-8 string to a normal string:

>>> str1="Adam Matan"
>>> str2="אדם מתן"
>>> print "X %20s X" % str1
X           Adam Matan X
>>> print "X %20s X" % str2
X        אדם מתן X

Note the difference:

X           Adam Matan X
X        אדם מתן X

Any ideas?

0

3 Answers 3

7

You need to specify that the second string is Unicode by putting u in front of the string:

>>> str1="Adam Matan"
>>> str2=u"אדם מתן"
>>> print "X %20s X" % str1
X           Adam Matan X
>>> print "X %20s X" % str2
X              אדם מתן X

Doing this lets Python know that it's counting Unicode characters, not just bytes.

Sign up to request clarification or add additional context in comments.

1 Comment

+1 for nice explanation. May want to checkout this tutorial for a better understanding sebsauvage.net/python/snyppets/#unicode
3

In Python 2 unprefixed string literals are of type str, which is a byte string. It stores arbitrary bytes, not characters. UTF-8 encodes some characters with more than one bytes. str2 therefore contains more bytes than actual characters, and shows the unexpected, but perfectly valid behaviour in string formatting. If you look at the actual byte content of these strings (use repr instead of print), you'll see, that in both strings the field is actually 20 bytes (not characters!) long.

As already mentioned, the solution is to use unicode strings. When working with strings in Python, you absolutely need to understand and realize the difference between unicode and byte strings.

Comments

1

Try this way:

>>> str1="Adam Matan"
>>> str2=unicode("אדם מתן", "utf8")
>>> print "X %20s X" % str2
X              אדם מתן X
>>> print "X %20s X" % str1
X           Adam Matan X

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.