Python string formatting + UTF-8 strange behaviour

Question

When printing a formatted string with a fixed length (e.g, %20s), the width differs from UTF-8 string to a normal string:

>>> str1="Adam Matan"
>>> str2="אדם מתן"
>>> print "X %20s X" % str1
X           Adam Matan X
>>> print "X %20s X" % str2
X        אדם מתן X

Note the difference:

X           Adam Matan X
X        אדם מתן X

Any ideas?

tghw · Accepted Answer · 2010-09-20 13:46:34Z

7

You need to specify that the second string is Unicode by putting u in front of the string:

>>> str1="Adam Matan"
>>> str2=u"אדם מתן"
>>> print "X %20s X" % str1
X           Adam Matan X
>>> print "X %20s X" % str2
X              אדם מתן X

Doing this lets Python know that it's counting Unicode characters, not just bytes.

answered Sep 20, 2010 at 13:46

tghw

25.4k13 gold badges73 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

rubayeet Over a year ago

+1 for nice explanation. May want to checkout this tutorial for a better understanding sebsauvage.net/python/snyppets/#unicode

user355252user355252 · Accepted Answer · 2010-09-20 13:51:24Z

In Python 2 unprefixed string literals are of type str, which is a byte string. It stores arbitrary bytes, not characters. UTF-8 encodes some characters with more than one bytes. str2 therefore contains more bytes than actual characters, and shows the unexpected, but perfectly valid behaviour in string formatting. If you look at the actual byte content of these strings (use repr instead of print), you'll see, that in both strings the field is actually 20 bytes (not characters!) long.

As already mentioned, the solution is to use unicode strings. When working with strings in Python, you absolutely need to understand and realize the difference between unicode and byte strings.

Michał Kwiatkowski · Accepted Answer · 2010-09-20 13:45:59Z

1

Try this way:

>>> str1="Adam Matan"
>>> str2=unicode("אדם מתן", "utf8")
>>> print "X %20s X" % str2
X              אדם מתן X
>>> print "X %20s X" % str1
X           Adam Matan X

answered Sep 20, 2010 at 13:45

Michał Kwiatkowski

9,8542 gold badges28 silver badges20 bronze badges

Collectives™ on Stack Overflow

Python string formatting + UTF-8 strange behaviour

3 Answers 3

1 Comment

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Related