0

Before someone says this is a duplicate question, I just want to let you know that the error I am getting from running this program in command line is different from all the other related questions I've seen.

I am trying to run a very short script in Python

from bs4 import BeautifulSoup
import urllib.request




html = urllib.request.urlopen("http://dictionary.reference.com/browse/word?s=t").read().strip()
dhtml = str(html, "utf-8").strip()
soup = BeautifulSoup(dhtml.strip(), "html.parser")
print(soup.prettify())

But I keep getting an error when I run this program with python.exe. UnicodeEncodeError: 'charmap' codec can't encode character '\u025c. I have tried a lot of methods to get around this, but I managed to isolate it to the problem of converting bytes to strings. When I run this program in IDLE, I get the HTML as expected. What is it that IDLE is automatically doing? Can I use IDLE's interpretation program instead of python.exe? Thanks!

EDIT:

My problem is caused by print(soup.prettify()) but type(soup.prettify()) returns str?

RESOLVED:

I finally made a decision to use encode() and decode() because of the trouble that has been caused. If someone knows how to actually resolve a question, please do; also, thank you for all your answers

8
  • Not seeing a character encoding declared on that page. Commented Jul 18, 2015 at 18:51
  • Do a ctrl+f for charset please Commented Jul 18, 2015 at 18:53
  • I think it is the first meta tag in head Commented Jul 18, 2015 at 18:53
  • You can also find out the encoding from BeautifulSoup Commented Jul 18, 2015 at 18:54
  • Sorry-you're right. Anyway, html validator shows 77 errors. validator.w3.org/nu/… Commented Jul 18, 2015 at 18:58

3 Answers 3

3

UnicodeEncodeError: 'charmap' codec can't encode character '\u025c'

The console character encoding can't represent '\u025c' i.e., "ɜ" Unicode character (U+025C LATIN SMALL LETTER REVERSED OPEN E).

What is it that IDLE is automatically doing?

IDLE displays Unicode directly (only BMP characters) if the corresponding font supports given Unicode characters.

Can I use IDLE's interpretation program instead of python.exe

Yes, run:

T:\> py -midlelib -r your_script.py

Note: you could write arbitrary Unicode characters to the Windows console if Unicode API is used:

T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script.py

See What's the deal with Python 3.4, Unicode, different languages and Windows?

Sign up to request clarification or add additional context in comments.

9 Comments

I'd change "display arbitrary Unicode characters in Windows console" to something like "write Unicode to the console". Available fonts depend on the Windows locale, and the console window doesn't support mixing halfwidth characters with fullwidth characters (CJK), i.e. a character can't map to 2 cells. It's also limited to the BMP because each cell stores a single wchar_t code, which excludes using UTF-16 surrogate pairs.
However, those limits are due to how conhost.exe works, not the console API itself. You can actually hide the window that conhost.exe creates and instead display the console screen buffer in a window that has more flexible font support. That's what ConEmu does.
@eryksun: yes, astral characters are displayed as boxes even if the font supports the characters. If you copy the boxes and paste into e.g., notepad then the characters are shown correctly.
One more question, I have both Python 3 and Python27, how do I access the Python 3 program?
Configure py to start Python 3 by default (unless you've changed it; it is probably the default already). Or specify the shebang or call py -3 explicitly.
|
1

I just want to let you know that the error I am getting from running this program in command line is different from all the other related questions I've seen.

Not really. You have PrintFails like everyone else.

The Windows console can't print Unicode. (This isn't strictly true, but going into exactly why, when and how you can get Unicode out of the console is a painful exercise and not usually worth it.) Trying to print a character that isn't in the console's limited encoding can't work, so Python gives you an error.

print them out (which I need an easier solution to because I cannot do .encode("utf-8") for a lot of elements

You could run the command set PYTHONIOENCODING=utf-8 before running the script to tell Python to use and encoding which can include any character (so no errors), but any non-ASCII output will still come out garbled as its encoding won't match the console's actual code page.

(Or indeed just use IDLE.)

1 Comment

My problem is that I need this on a localhost server and my data is in bytes
0

I finally made a decision to use encode() and decode() because of the trouble that has been caused. If someone knows how to actually resolve a question, please do; also, thank you for all your answers

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.