1

I have just started using Python, and I got a problem with non-english (Vietnamese in particularly) input. When I run this code:

# -*- coding: unicode-escape -*-
s = raw_input()
print(s)

s = "hiển thị 15 dòng"
print(s)

and from terminal I type extracly the same string, it printed like:

hiển thị 15 dòng
hi\xe1\xbb\x83n th\xe1\xbb\x8b 15 d\xc3\xb2ng

It also make differences in when I use these 2 type of strings in other function as I found the first one didn't work but the second one did. Would anyone give me some hints? Thank you!

4
  • Xin chào, I suggest that you save yourself from confusion; if you're a newcomer to Python, you should be using Python 3, which is very mature by now, it has proper Unicode support built in. Commented Apr 13, 2015 at 5:26
  • Thanks, but my project forces me to use Python 2.7 :) Commented Apr 13, 2015 at 5:31
  • What platform are you on? If you're on Windows, your terminal may not be able to support printing Unicode strings, so even after you fix things (as Raniz's answer shows) you still may not get to see what you want. If you're on any *nix besides Mac OS X, your terminal probably can support Unicode strings, but Python 2 may still not guess the right encoding, causing similar problems… Commented Apr 13, 2015 at 5:38
  • @AnttiHaapala I just look at the output when I run it on terminal Commented Apr 13, 2015 at 5:38

3 Answers 3

2

The problem is that you are using # -*- coding: unicode-escape -*- in your source file. It causes python to escape all the bytes greater than 128 in the UTF-8 representation with the \xnn hex escape, turning your string into

'hi\xe1\xbb\x83n th\xe1\xbb\x8b 15 d\xc3\xb2ng'

Thus with # -*- coding: unicode-escape -*-:

s = "hiển thị 15 dòng"

will become

s = 'hi\\xe1\\xbb\\x83n th\\xe1\\xbb\\x8b 15 d\\xc3\\xb2ng'

The cause of course is using unicode-escape as a codec for coding; use utf-8 instead:

# -*- coding: utf-8 -*-
Sign up to request clarification or add additional context in comments.

2 Comments

By the way what is the default codec of I/O in terminal?
it will depend on the locale settings; you can see it from the locale command ; if your locale is for example en_US.UTF-8, or vi_VN.UTF-8 things should mostly work.
2

I assume that you're using Python 2.x?

If so, put the following at the top of your file:

# -*- coding: utf-8 -*-

And ensure that your strings are unicode strings:

s = raw_input().decode("utf-8").
print(s)

s = u"hiển thị 15 dòng"
print(s)

9 Comments

Thanks, I added " # -- coding: unicode-escape -- " on the top of the code. But what I concern here is that why it makes differences and how to make the first type of input works as the seconde one does :)
@HưngCaoXuân: Why did you add unicode-escape instead of utf-8? If you use unicode-escape, Python won't be able to handle things like s = u"hiển thị 15 dòng" in your code; the only thing that will work will be s = u"hi\u1ec3n th\u1ecb 15 d\xf2ng" instead. (And if you really want to write that, it will work even without an encoding declaration.)
Python 2.x assumes that your source code is in either ASCII or Latin-1 (can't remember which one). When python reads your source file it encounters the byte sequence for your vietnamese characters and tries to read it using the default encoding - resulting in mojibake.
@Raniz: raw_input doesn't use the locale at all; it just reads the bytes as bytes, and you have to decide how to decode them manually.
@Raniz: No, not there either; decoding happens in the .decode('utf-8') that you explicitly put in your code to fix his problem. :)
|
2

you may try to replace the # -*- coding: unicode-escape -*- with # -*- coding: utf-8 -*- in the beginning of the file to specify the encoding of the code file, which depends on your system default file encoding.

4 Comments

You have to do that, and also do the other fixes Raniz suggested.
The thing is, if coding is not specified, the Vietnamese text will make the code not run at all, so that is not the case.
It seems strange that raw_input() returns a str which should be exactly the same as the second s below. The first s is printed appropriately but the second doesn't.
Indeed this answer was correct unlike I first thought (and the reason behind that was that the code excerpt was not a self-contained example)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.