Input in Python

Question

I have just started using Python, and I got a problem with non-english (Vietnamese in particularly) input. When I run this code:

# -*- coding: unicode-escape -*-
s = raw_input()
print(s)

s = "hiển thị 15 dòng"
print(s)

and from terminal I type extracly the same string, it printed like:

hiển thị 15 dòng
hi\xe1\xbb\x83n th\xe1\xbb\x8b 15 d\xc3\xb2ng

It also make differences in when I use these 2 type of strings in other function as I found the first one didn't work but the second one did. Would anyone give me some hints? Thank you!

Xin chào, I suggest that you save yourself from confusion; if you're a newcomer to Python, you should be using Python 3, which is very mature by now, it has proper Unicode support built in. — Antti Haapala
– Antti Haapala, Commented Apr 13, 2015 at 5:26
What platform are you on? If you're on Windows, your terminal may not be able to support printing Unicode strings, so even after you fix things (as Raniz's answer shows) you still may not get to see what you want. If you're on any *nix besides Mac OS X, your terminal probably can support Unicode strings, but Python 2 may still not guess the right encoding, causing similar problems… — abarnert
– abarnert, Commented Apr 13, 2015 at 5:38
@AnttiHaapala I just look at the output when I run it on terminal — Hưng Cao Xuân
– Hưng Cao Xuân, Commented Apr 13, 2015 at 5:38

Antti Haapala · Accepted Answer · 2015-04-13 05:53:18Z

2

The problem is that you are using # -*- coding: unicode-escape -*- in your source file. It causes python to escape all the bytes greater than 128 in the UTF-8 representation with the \xnn hex escape, turning your string into

'hi\xe1\xbb\x83n th\xe1\xbb\x8b 15 d\xc3\xb2ng'

Thus with # -*- coding: unicode-escape -*-:

s = "hiển thị 15 dòng"

will become

s = 'hi\\xe1\\xbb\\x83n th\\xe1\\xbb\\x8b 15 d\\xc3\\xb2ng'

The cause of course is using unicode-escape as a codec for coding; use utf-8 instead:

# -*- coding: utf-8 -*-

answered Apr 13, 2015 at 5:53

Antti Haapala

135k23 gold badges296 silver badges348 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Hưng Cao Xuân Over a year ago

By the way what is the default codec of I/O in terminal?

Antti Haapala Over a year ago

it will depend on the locale settings; you can see it from the locale command ; if your locale is for example en_US.UTF-8, or vi_VN.UTF-8 things should mostly work.

Raniz · Accepted Answer · 2015-04-13 05:40:32Z

2

I assume that you're using Python 2.x?

If so, put the following at the top of your file:

# -*- coding: utf-8 -*-

And ensure that your strings are unicode strings:

s = raw_input().decode("utf-8").
print(s)

s = u"hiển thị 15 dòng"
print(s)

edited Apr 13, 2015 at 5:40

answered Apr 13, 2015 at 5:27

Raniz

11.2k1 gold badge37 silver badges64 bronze badges

9 Comments

Hưng Cao Xuân Over a year ago

Thanks, I added " # -- coding: unicode-escape -- " on the top of the code. But what I concern here is that why it makes differences and how to make the first type of input works as the seconde one does :)

abarnert Over a year ago

@HưngCaoXuân: Why did you add unicode-escape instead of utf-8? If you use unicode-escape, Python won't be able to handle things like s = u"hiển thị 15 dòng" in your code; the only thing that will work will be s = u"hi\u1ec3n th\u1ecb 15 d\xf2ng" instead. (And if you really want to write that, it will work even without an encoding declaration.)

Raniz Over a year ago

Python 2.x assumes that your source code is in either ASCII or Latin-1 (can't remember which one). When python reads your source file it encounters the byte sequence for your vietnamese characters and tries to read it using the default encoding - resulting in mojibake.

abarnert Over a year ago

@Raniz: raw_input doesn't use the locale at all; it just reads the bytes as bytes, and you have to decide how to decode them manually.

abarnert Over a year ago

@Raniz: No, not there either; decoding happens in the .decode('utf-8') that you explicitly put in your code to fix his problem. :)

|

Antti Haapala · Accepted Answer · 2015-04-13 05:45:53Z

2

you may try to replace the # -*- coding: unicode-escape -*- with # -*- coding: utf-8 -*- in the beginning of the file to specify the encoding of the code file, which depends on your system default file encoding.

edited Apr 13, 2015 at 5:45

Antti Haapala

135k23 gold badges296 silver badges348 bronze badges

answered Apr 13, 2015 at 5:27

zhangwt

3501 gold badge2 silver badges12 bronze badges

4 Comments

abarnert Over a year ago

You have to do that, and also do the other fixes Raniz suggested.

Antti Haapala Over a year ago

The thing is, if coding is not specified, the Vietnamese text will make the code not run at all, so that is not the case.

zhangwt Over a year ago

It seems strange that raw_input() returns a str which should be exactly the same as the second s below. The first s is printed appropriately but the second doesn't.

Antti Haapala Over a year ago

Indeed this answer was correct unlike I first thought (and the reason behind that was that the code excerpt was not a self-contained example)

Collectives™ on Stack Overflow

Input in Python

3 Answers 3

2 Comments

9 Comments

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

9 Comments

4 Comments

Related