Python string encode and decode

Question

Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.

So encoding here means converting to a particular format.

In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function. Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'

I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")

What am I missing here.

Your console or terminal is set to UTF-8, so typing in "奥多比" sends UTF-8 bytes to the Python interactive interpreter process. Decoding then creates a Unicode object from the UTF-8 bytes. — Martijn Pieters
– Martijn Pieters, Commented Jan 8, 2018 at 13:12
@MartijnPieters: SO when this is part of a script and I write : str = "奥多比." and then str.decode("utf-8") then that means that str is essentially the utf-8 already? However when I append it to the URL of an API call, it is sent as "奥多比." only and not in the encoded format. — Bhumi Singhal
– Bhumi Singhal, Commented Jan 8, 2018 at 13:16
URLs are not UTF-8 encoded. They are percent encoded, often using UTF-8 as a starting point. In Python 2, use import urllib, then urllib.quote() to create URL percent-encoded data. Start with UTF-8 bytes. — Martijn Pieters
– Martijn Pieters, Commented Jan 8, 2018 at 13:21

Martijn Pieters · Accepted Answer · 2018-01-08 13:29:44Z

2

You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.

You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.

The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.

I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:

Ned Batchelder's Pragmatic Unicode
The Python Unicode HOWTO
Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

edited Jan 8, 2018 at 13:29

answered Jan 8, 2018 at 13:16

Martijn Pieters

1.1m325 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

variable Over a year ago

You mention that -

"奥多比"' is a byte literal. I think they are unicode characters. There is no purpose to decode them. Where as if we encode them then we get a byte representation. The byte representation can be transferred over network or file and the receiver can decode the byte to get the original

"奥多比" value. Have I got this concept right?

Martijn Pieters Over a year ago

@variable: are you using Python 2? If not, then this is not something you need to worry about nearly as much. In any case, read the links I included, especially Ned Batchelder's. Try out the concepts in your interactive interpreter. Bytes are the lingua franca of data exchange, everything is bytes. Decoding to a text type (unicode in Python 2, str in Python 3) is turning bytes into a more useful object type, like using` datetime.strftime()` or int() or json.load().

variable Over a year ago

I'm using python 3 and have read that str in python 3+ is unicode

Tomáš Linhart · Accepted Answer · 2018-01-08 13:15:22Z

1

In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.

answered Jan 8, 2018 at 13:15

Tomáš Linhart

10.2k1 gold badge30 silver badges42 bronze badges

Collectives™ on Stack Overflow

Python string encode and decode

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related