Django - UnicodeDecodeError: weird character "�"

Question

I am using Goose engine to extract article text from a url using the following code:

g = Goose()
article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

It looks this URL is for some problematic because I am getting the following error:

'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
The string that could not be encoded/decoded was: �

I am correctly specifying utf-8 as my codec at the top of my file like so:

# -*- coding: utf-8 -*-

How can I solve this issue?

EDIT: Stack Trace:

Environment:


Request Method: GET
Request URL: http://localhost:3000/scansources/

Django Version: 1.5.1
Python Version: 2.7.2
Installed Applications:
('django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.sites',
 'django.contrib.messages',
 'django.contrib.staticfiles',
 'summaries',
 'sources_scan')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware')


Traceback:
File "/Library/Python/2.7/site-packages/django/core/handlers/base.py" in get_response
  115.                         response = callback(request, *callback_args, **callback_kwargs)
File "/Users/yonatanoren/Documents/python/summarizer/sources_scan/views.py" in scan_sources
  183.              article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/__init__.py" in extract
  53.         return self.crawl(cc)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/__init__.py" in crawl
  60.         article = crawler.crawl(crawl_candiate)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/crawler.py" in crawl
  90.         article.top_node = extractor.calculate_best_node(article)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/extractors.py" in calculate_best_node
  248.             text_node = self.parser.getText(node)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/parsers.py" in getText
  179.         txts = [i for i in node.itertext()]

Exception Type: UnicodeDecodeError at /scansources/
Exception Value: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Thanks.

EDIT: Using the python shell I get the same error with this code:

>>> g = Goose()
>>> article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

I also updated all of my files to use the following, and still get the error.

#encoding=utf-8

I believe this may be a problem with Goose itself. Because Goose handles the text and returns it. How would I solve it in this case?

EDIT: the following doesn't make a difference either

text = unicode(article.cleaned_text,'utf-8')

Are you sure that the error is caused by the g.extract call? Or does it happen when you try to convert its result to a string later? — Johannes Charra
– Johannes Charra, Commented Sep 18, 2013 at 6:30
In order to help with solution, please, post all stack trace error. — dani herrera
– dani herrera, Commented Sep 18, 2013 at 6:33
The coding comment at the top only applies to how the Python compiler interprets the source code; data read from elsewhere, or sent elsewhere, is encoded and decoded according to different rules altogether. — Martijn Pieters
– Martijn Pieters, Commented Sep 18, 2013 at 6:51
@JohannesCharra looking at the stack trace it seems like the error is caused by the extracted article text (being converted to a string?). — TheProofIsTrivium
– TheProofIsTrivium, Commented Sep 18, 2013 at 7:26
@MartijnPieters How can I solve this problem? it seems like you may know a solution, thanks. — TheProofIsTrivium
– TheProofIsTrivium, Commented Sep 21, 2013 at 19:15

thinker3 · Accepted Answer · 2013-09-18 07:13:24Z

1

you may try raw_html extraction: https://github.com/grangier/python-goose#known-issues

you may do some encoding/decoding with the raw html.

answered Sep 18, 2013 at 7:13

thinker3

13.4k5 gold badges32 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

OBu · Accepted Answer · 2013-09-18 20:34:19Z

0

Maybe it helps to use unicode for all strings: Insert from __future__ import unicode_literals at the very first line of your python file and re-try...

answered Sep 18, 2013 at 20:34

OBu

5,1973 gold badges31 silver badges46 bronze badges

2 Comments

OBu Over a year ago

Did you try the article extraction without django in a simple toy project? Do you get the same error?

OBu Over a year ago

I don't know whether it's good or bad news, but I can reproduce your error... It fails in command line when extracting this specific url (and works for some other urls, for some it fails silently). This sounds like it's time for a bug report. E.g. nytimes.com/2013/09/22/technology/… is not extracted correctly (but raises no exceptions)

yuvi · Accepted Answer · 2013-09-18 20:42:17Z

0

Try adding a little u before the string. I don't see any weird characters there, but I usually use hebrew in my django code and the bash at the top is not always enough

article = g.extract(url=u"http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

answered Sep 18, 2013 at 20:42

yuvi

18.5k9 gold badges63 silver badges98 bronze badges

2 Comments

TheProofIsTrivium Over a year ago

I think it's actually the text being extracted, not the URL, how can i solve it in this case?

yuvi Over a year ago

I think you're right. few things to look into: 1. change the bash to the django bash (you're using python's, you should use django's: #encoding=utf-8) 2. use unicode(x, 'utf-8') and other encodine\decoding tools (including ugettext and such) and play around with it, see what happens

user637644 · Accepted Answer · 2014-10-13 16:26:53Z

0

Even though I can't reproduce error with this URL, I had similar problems with python-goose. Try:

from goose.configuration import Configuration
from goose import Goose


config = Configuration()
config.parser_class = 'soupparser' # this helped me
g = Goose(config)
article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

answered Oct 13, 2014 at 16:26

user637644

1

Collectives™ on Stack Overflow

Django - UnicodeDecodeError: weird character "�"

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

2 Comments

Comments

Related