0

Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.

So encoding here means converting to a particular format.

In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function. Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'

I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")

What am I missing here.

9
  • You convert from UTF-8 to a Unicode object. Commented Jan 8, 2018 at 13:12
  • Your console or terminal is set to UTF-8, so typing in "奥多比" sends UTF-8 bytes to the Python interactive interpreter process. Decoding then creates a Unicode object from the UTF-8 bytes. Commented Jan 8, 2018 at 13:12
  • @MartijnPieters: SO when this is part of a script and I write : str = "奥多比." and then str.decode("utf-8") then that means that str is essentially the utf-8 already? However when I append it to the URL of an API call, it is sent as "奥多比." only and not in the encoded format. Commented Jan 8, 2018 at 13:16
  • So are you really asking how to send UTF-8 bytes in a URL? Commented Jan 8, 2018 at 13:19
  • URLs are not UTF-8 encoded. They are percent encoded, often using UTF-8 as a starting point. In Python 2, use import urllib, then urllib.quote() to create URL percent-encoded data. Start with UTF-8 bytes. Commented Jan 8, 2018 at 13:21

2 Answers 2

2

You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.

You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.

The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.

I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:

Sign up to request clarification or add additional context in comments.

3 Comments

You mention that - "奥多比"' is a byte literal. I think they are unicode characters. There is no purpose to decode them. Where as if we encode them then we get a byte representation. The byte representation can be transferred over network or file and the receiver can decode the byte to get the original "奥多比" value. Have I got this concept right?
@variable: are you using Python 2? If not, then this is not something you need to worry about nearly as much. In any case, read the links I included, especially Ned Batchelder's. Try out the concepts in your interactive interpreter. Bytes are the lingua franca of data exchange, everything is bytes. Decoding to a text type (unicode in Python 2, str in Python 3) is turning bytes into a more useful object type, like using` datetime.strftime()` or int() or json.load().
I'm using python 3 and have read that str in python 3+ is unicode
1

In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.