0

Recently, using Python 3.8, I ran into encoding issue. I simplified the issue to a few lines of code. Maybe someone from the Python community could throw some light on the behavior I see:

import os, sys
c = chr(146)            # character hex 92 dec 146, end quote mark in cp1252
a = "Don" + c + "t"     # Don't with end quote instead of apostrophe
ae = a.encode('cp1252', errors='replace')
print(ae)
print(a)
sys.stdout.reconfigure(encoding='cp1252')
print(a)

OUTPUT:

b'Don?t'
Dont
Traceback (most recent call last):
  File "c:/1data/DEV/MyPy/Test/test_e1.py", line 8, in <module>
    print(a)
  File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 3: character maps to <undefined>

So, since the \x92 is a valid character in "cp1252", why is \x92 replaced by '?' in first line of output. If I did not use errors="replace" it would raise an exception. Why, printing to standard out with "cp1252" raises an exception when printing to standard out with 'utf-8' doesn't?

5
  • Your terminal does not support cp1252, that's really all there is to it. You prove it by being able to print UTF-8 encoded strings. You can only set one encoding to a terminal. Commented May 4, 2020 at 17:41
  • 2
    Python3 strings are composed of unicode codepoints, not character set byte values: ord(b'\x92'.decode('cp1252')) -> 8217, which is the codepoint of 'RIGHT SINGLE QUOTATION MARK' Commented May 4, 2020 at 17:43
  • I am running this on Windows so cp1252 is definitely supported. The output is from terminal in Visual Studio Code. The CMD terminal show the same behavior. Commented May 4, 2020 at 17:43
  • Is there a way in Python to deal with strings like in "C", with single byte character? Is Python 2.x like that? Commented May 4, 2020 at 17:50
  • Python2 is more like that, but it sounds as if you want to work with bytes rather than str? bs = b'Don' + b'\x92' + b't' -> b'Don\x92t'; bs.decode('cp1252') -> 'Don’t' Commented May 4, 2020 at 17:53

1 Answer 1

2

From https://en.wikipedia.org/wiki/Unicode : Sixty-five code points (U+0000–U+001F and U+007F–U+009F) are reserved as control codes, and correspond to the C0 and C1 control codes defined in ISO/IEC 6429. U+0009 (Tab), U+000A (Line Feed), and U+000D (Carriage Return) are widely used in Unicode-encoded texts. In practice the C1 code points are often improperly-translated (Mojibake) legacy Windows-1252 characters used by some English and Western European texts with Windows technologies. So chr(146) in Unicode does not represent the character.

To get the character in a Python3 (Unicode) string you could either:

  • convert from bytes type: b'Don\x92t'.decode('cp1252')
  • find the correct Unicode codepoint for , which is 8217 dec or 2019 hex : 'Don\u2019t'
  • just type the character: 'Don’t' - Python3 accepts unicode characters in source files
Sign up to request clarification or add additional context in comments.

6 Comments

Yes Python 3 accepts Unicode characters in source files, but you may need a special encoding comment to tell it how your source file is encoded.
@MarkRansom isn't UTF-8 the default now, in lack of # -*- coding: utf-8 -*- string? I vaguely recall seeing that somewhere.
I'm not sure, but even if it is there's no guarantee the source was saved as UTF-8. Especially given the details of the question.
@MarkRansom true.
I thank you all for your input. For me (new to Python due extra of time because of quarantine) it's hard to deal with strings other than bytes.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.