Encoding in utf-16be and decoding in utf-8 print the correct output but cannot be converted into Python representation?

Question

If I'm encoding a string using utf-16be and decoding the encoded string using utf-8, I'm not getting any error and the output seems to be correctly getting printed on the screen as well but still I'm not able to convert the decoded string into Python representation using json module.

import json

str = '{"foo": "bar"}'
encoded_str = str.encode("utf-16be")
decoded_str = encoded_str.decode('utf-8')
print(decoded_str)
print(json.JSONDecoder().decode(decoded_str))

I know that encoded string should be decoded using the same encoding, but why this behaviour is what I'm trying to understand? I want to know:

Why encoding str with utf-16be and decoding encoded_str with utf-8 doesn't result in an error?
As encoding and decoding is not resulting in an error and the decoded_str is a valid JSON (as can be seen through the print statement), why decode(decoded_str) result in an error?
Why writing the output to a file and viewing the file through less command show it as binary file?
```
file = open("data.txt", 'w')
file.write(decoded_str)
```
When using less command to view the data.txt:
```
"data.txt" may be a binary file.  See it anyway?
```
If the decoded_str is an invalid JSON or something else, how can I view it in its original form (print() is printing it as a valid JSON )

I'm using Python 3.10.12 on Ubuntu 22.04.4 LTS

Encoding with one encoding and decoding with another is almost always a mistake. — Mark Ransom
– Mark Ransom, Commented Nov 7, 2024 at 18:43
Try hexdump data.txt: you see (effectively) utf-16be-encoded text; note that decoded_str contains '\x00{\x00"\x00f\x00o\x00o\x00"\x00:\x00 \x00"\x00b\x00a\x00r\x00"\x00}' — JosefZ
– JosefZ, Commented Nov 7, 2024 at 18:51

DPenner1 · Accepted Answer · 2024-11-07 18:24:16Z

Why encoding str with utf-16be and decoding encoded_str with utf-8 doesn't result in an error?

Because in this case, the resulting bytes of str.encode("utf-16be") are also valid UTF-8. This is in fact always the case with ASCII characters, you really need to go above U+007F to trigger possible errors here (eg. use the string str = '{"foo": "！"}' which uses a full-width exclamation mark, U+FF01).

As encoding and decoding is not resulting in an error and the decoded_str is a valid JSON (as can be seen through the print statement), why decode(decoded_str) result in an error?

Just because you can print a string does not make it valid JSON. In particular because of the encoding to UTF-16, a bunch of null bytes got added. For example, f in UTF-16BE is 0x0066. Those bytes when re-encoded in UTF-8 actually constitute two characters, f and the null character 0x00. Based on my reading of the JSON spec, null characters are not allowed and that is why decode(decoded_str) fails.

Why writing the output to a file and viewing the file through less command show it as binary file?

Probably those null bytes again. With a lot of null bytes, less is probably flagging it might be a binary file as this is relatively uncommon in UTF-8 (and Linux much prefers UTF-8 over UTF-16)

If the decoded_str is an invalid JSON or something else, how can I view it in its original form (print() is printing it as a valid JSON )

Too many possible answers here, it really depends on what the real use case is here. The quickest one is just don't encode/decode with different encodings. The next quickest is reverse the encode/decode process, though this is not lossless with all strings or encoding possibilities, in particular the surrogate range when dealing with a UTF-16 + UTF-8 mix-up.

Thank you! That also explains why decoded_str length is twice of str. decoded_str contains len/2 null characters.

Mark Tolonen · Accepted Answer · 2024-11-08 00:08:05Z

Print the resulting encoding and you will see the issue:

import json

str = '{"foo": "bar"}'
encoded_str = str.encode("utf-16be")
print(encoded_str)
print(encoded_str.hex(' '))
decoded_str = encoded_str.decode('utf-8')
print(decoded_str)

Output:

b'\x00{\x00"\x00f\x00o\x00o\x00"\x00:\x00 \x00"\x00b\x00a\x00r\x00"\x00}'
00 7b 00 22 00 66 00 6f 00 6f 00 22 00 3a 00 20 00 22 00 62 00 61 00 72 00 22 00 7d
 { " f o o " :   " b a r " }

Note that on my terminal, U+0000 (NUL) prints as a space so you can see that there is something off on the decoded string. From the OP description, their terminal doesn't print anything for nulls hence the JSON string probably still looked like {"foo": "bar"}.

UTF-16BE encoding has a lot of null bytes when encoding ASCII. Since your original string was ASCII, all the bytes including the nulls are valid ASCII. UTF-8 is a superset of ASCII, so it decodes successfully, but includes all those null bytes. Null bytes are not allowed in JSON hence the error decoding it as JSON. Null bytes are used to detect a binary file as well.

Thank you! So it was the terminal that was replacing those nulls with nothing and therefore when I copied the output from the terminal and tried converting it into python representation using the json module it was working fine because the output in the terminal was without nulls.

Collectives™ on Stack Overflow

Encoding in utf-16be and decoding in utf-8 print the correct output but cannot be converted into Python representation?

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related