Recently, using Python 3.8, I ran into encoding issue. I simplified the issue to a few lines of code. Maybe someone from the Python community could throw some light on the behavior I see:
import os, sys
c = chr(146) # character hex 92 dec 146, end quote mark in cp1252
a = "Don" + c + "t" # Don't with end quote instead of apostrophe
ae = a.encode('cp1252', errors='replace')
print(ae)
print(a)
sys.stdout.reconfigure(encoding='cp1252')
print(a)
OUTPUT:
b'Don?t'
Dont
Traceback (most recent call last):
File "c:/1data/DEV/MyPy/Test/test_e1.py", line 8, in <module>
print(a)
File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 3: character maps to <undefined>
So, since the \x92 is a valid character in "cp1252", why is \x92 replaced by '?' in first line of output. If I did not use errors="replace" it would raise an exception. Why, printing to standard out with "cp1252" raises an exception when printing to standard out with 'utf-8' doesn't?
ord(b'\x92'.decode('cp1252'))->8217, which is the codepoint of 'RIGHT SINGLE QUOTATION MARK'bs = b'Don' + b'\x92' + b't'->b'Don\x92t';bs.decode('cp1252')->'Don’t'