4

I have a Python 3 program that reads some strings from a Windows-1252 encoded file:

with open(file, 'r', encoding="cp1252") as file_with_strings:
    # save some strings

Which I later want to write to stdout. I've tried to do:

print(some_string)
# => UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 180: ordinal not in range(128)

print(some_string.decode("utf-8"))
# => AttributeError: 'str' object has no attribute 'decode'

sys.stdout.buffer.write(some_str)
# => TypeError: 'str' does not support the buffer interface

print(some_string.encode("cp1252").decode("utf-8"))
# => UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 180: invalid continuation byte

print(some_string.encode("cp1252"))
# => has the unfortunate result of printing b'<my string>' instead of just the string

I'm scratching my head here. I'd like to print the string I got from the file just as it appears there, in cp1252. (In my terminal, when I do more $file, these characters appear as question marks, so my terminal is probably ascii.)

Would love some clarification! Thanks!

4
  • What does string_to_print = some_string.decode('utf-8'); print(string_to_print) do? Commented Mar 3, 2016 at 3:30
  • It's just a str, so I get AttributeError: 'str' object has no attribute 'decode' Commented Mar 3, 2016 at 3:34
  • "(In my terminal, when I do more $file, these characters appear as question marks, so my terminal is probably ascii.)" <- no, seeing as though in your answer you're writing cp1252, then your terminal encoding probably doesn't match your locale. Commented Mar 7, 2016 at 19:02
  • I'm voting to close this question as off-topic because the actual problem is too localised - it's caused by an incorrectly configured environment and/or by usage but is not properly described. Commented Mar 8, 2016 at 17:34

5 Answers 5

10

Since Python 3.7, you can change the encoding of all text written to sys.stdout with the reconfigure method:

import sys

sys.stdout.reconfigure(encoding="cp1252")

That could be helpful if you need to change the encoding for all output from your program.

Sign up to request clarification or add additional context in comments.

Comments

2

To anybody out there with the same problem, I ended up doing:

to_print = (some_string + "\n").encode("cp1252")
sys.stdout.buffer.write(to_print)
sys.stdout.flush() # I write a ton of these strings, and segfaulted without flushing

Comments

1

When you encode with cp1252, you have to decode with the same.

Eg:

import sys
txt = ("hi hello\n").encode("cp1252")
#print((txt).decode("cp1252"))
sys.stdout.buffer.write(txt)
sys.stdout.flush()

This will print "hi hello\n" (which was encoded in cp1252) after decoding it.

4 Comments

Printing after decode just tries to print a Unicode string, which leads you right back where you started. Your example only works because it only contains ASCII characters.
Yeah, agreed. Buffer writer has to be used.
This helped me a lot. I was reading from a STDIN, and writing to a file worked, as you can set the encoding in open(), but printing was a nightmare.
If different codecs are used for encoding and decoding (e.g. print(txt.encode("utf-8").decode("cp1252")) ) the result is not identical and may be printable. The translation errors can actually be helpful to find the offending characters.
0

You're either piping to your script or your locale is broken. You should fix your environment, rather than fixing your script to your environment, as this will make your script very brittle.

If you're piping, Python assumes the output should be "ASCII" and sets the encoding of stdout to "ASCII".

Under normal conditions, Python uses the locale to work out what encoding to apply to stdout. If your locale is broken (Not installed or corrupt), Python will default to "ASCII". A locale of "C", will also give you an encoding of "ASCII".

Check your locale by typing locale and ensure no errors are returned. E.g.

$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=

If all else fails or you're piping, you can override Python's locale detection by setting the PYTHONIOENCODING environment variable. E.g.

$ PYTHONIOENCODING=utf-8 ./my_python.sh

Remember that your shell has a locale and your terminal has an encoding - they both need to be set correctly

5 Comments

Not piping, but it's also not my environment - it's a program that I have to run on school servers, which have ascii terminals. I could change my personal environment or use a different terminal, but I can't guarantee that the graders will.
It's Debian, I'm handing in a .py file that will be run with python3 by someone on a different computer, but reading from the same files, and always trying to write to ascii stdout
If your terminals really are ASCII (they probably aren't), why is your answer encoding to "cp1252"?
I have to encode to cp1252 to maintain the accent marks that were in the original data. This script's output will be redirected to a file, and I want that file to have those accent marks. My locale has nothing set for LANG/LANGUAGE or ALL, and everything else is "POSIX", fwiw
1) Your terminals are not ASCII if they're displaying cp1252. 2) Your environment is not setup correctly if you don't have a LANG defined. That is why "more" is failing. You may find your students have correctly configured environments or a different encoding setup, meaning your brittle code will break
0

This is not working

plt.savefig(sys.stdout.buffer)

Use this instead of buffer

plt.savefig(sys.stdout.encoding) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.