Python 3.8 encoding issue

Question

Recently, using Python 3.8, I ran into encoding issue. I simplified the issue to a few lines of code. Maybe someone from the Python community could throw some light on the behavior I see:

import os, sys
c = chr(146)            # character hex 92 dec 146, end quote mark in cp1252
a = "Don" + c + "t"     # Don't with end quote instead of apostrophe
ae = a.encode('cp1252', errors='replace')
print(ae)
print(a)
sys.stdout.reconfigure(encoding='cp1252')
print(a)

OUTPUT:

b'Don?t'
Dont
Traceback (most recent call last):
  File "c:/1data/DEV/MyPy/Test/test_e1.py", line 8, in <module>
    print(a)
  File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 3: character maps to <undefined>

So, since the \x92 is a valid character in "cp1252", why is \x92 replaced by '?' in first line of output. If I did not use errors="replace" it would raise an exception. Why, printing to standard out with "cp1252" raises an exception when printing to standard out with 'utf-8' doesn't?

Your terminal does not support cp1252, that's really all there is to it. You prove it by being able to print UTF-8 encoded strings. You can only set one encoding to a terminal. — Jongware
– Jongware, Commented May 4, 2020 at 17:41
Python3 strings are composed of unicode codepoints, not character set byte values: ord(b'\x92'.decode('cp1252')) -> 8217, which is the codepoint of 'RIGHT SINGLE QUOTATION MARK' — snakecharmerb
– snakecharmerb, Commented May 4, 2020 at 17:43
I am running this on Windows so cp1252 is definitely supported. The output is from terminal in Visual Studio Code. The CMD terminal show the same behavior. — Christopher B.
– Christopher B., Commented May 4, 2020 at 17:43
Is there a way in Python to deal with strings like in "C", with single byte character? Is Python 2.x like that? — Christopher B.
– Christopher B., Commented May 4, 2020 at 17:50
Python2 is more like that, but it sounds as if you want to work with bytes rather than str? bs = b'Don' + b'\x92' + b't' -> b'Don\x92t'; bs.decode('cp1252') -> 'Don’t' — snakecharmerb
– snakecharmerb, Commented May 4, 2020 at 17:53

Błotosmętek · Accepted Answer · 2020-05-04 17:58:23Z

2

From https://en.wikipedia.org/wiki/Unicode : Sixty-five code points (U+0000–U+001F and U+007F–U+009F) are reserved as control codes, and correspond to the C0 and C1 control codes defined in ISO/IEC 6429. U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts. In practice the C1 code points are often improperly-translated (Mojibake) legacy Windows-1252 characters used by some English and Western European texts with Windows technologies. So chr(146) in Unicode does not represent the ’ character.

To get the ’ character in a Python3 (Unicode) string you could either:

convert from bytes type: b'Don\x92t'.decode('cp1252')
find the correct Unicode codepoint for ’, which is 8217 dec or 2019 hex : 'Don\u2019t'
just type the character: 'Don’t' - Python3 accepts unicode characters in source files

answered May 4, 2020 at 17:58

Błotosmętek

13k23 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Mark Ransom Over a year ago

Yes Python 3 accepts Unicode characters in source files, but you may need a special encoding comment to tell it how your source file is encoded.

Błotosmętek Over a year ago

@MarkRansom isn't UTF-8 the default now, in lack of # -*- coding: utf-8 -*- string? I vaguely recall seeing that somewhere.

Mark Ransom Over a year ago

I'm not sure, but even if it is there's no guarantee the source was saved as UTF-8. Especially given the details of the question.

Błotosmętek Over a year ago

@MarkRansom true.

Christopher B. Over a year ago

I thank you all for your input. For me (new to Python due extra of time because of quarantine) it's hard to deal with strings other than bytes.

|

Collectives™ on Stack Overflow

Python 3.8 encoding issue

1 Answer 1

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Related