0

I am using Goose engine to extract article text from a url using the following code:

g = Goose()
article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

It looks this URL is for some problematic because I am getting the following error:

'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
The string that could not be encoded/decoded was: �

I am correctly specifying utf-8 as my codec at the top of my file like so:

# -*- coding: utf-8 -*-

How can I solve this issue?

EDIT: Stack Trace:

Environment:


Request Method: GET
Request URL: http://localhost:3000/scansources/

Django Version: 1.5.1
Python Version: 2.7.2
Installed Applications:
('django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.sites',
 'django.contrib.messages',
 'django.contrib.staticfiles',
 'summaries',
 'sources_scan')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware')


Traceback:
File "/Library/Python/2.7/site-packages/django/core/handlers/base.py" in get_response
  115.                         response = callback(request, *callback_args, **callback_kwargs)
File "/Users/yonatanoren/Documents/python/summarizer/sources_scan/views.py" in scan_sources
  183.              article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/__init__.py" in extract
  53.         return self.crawl(cc)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/__init__.py" in crawl
  60.         article = crawler.crawl(crawl_candiate)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/crawler.py" in crawl
  90.         article.top_node = extractor.calculate_best_node(article)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/extractors.py" in calculate_best_node
  248.             text_node = self.parser.getText(node)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/parsers.py" in getText
  179.         txts = [i for i in node.itertext()]

Exception Type: UnicodeDecodeError at /scansources/
Exception Value: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

Thanks.

EDIT: Using the python shell I get the same error with this code:

>>> g = Goose()
>>> article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

I also updated all of my files to use the following, and still get the error.

#encoding=utf-8

I believe this may be a problem with Goose itself. Because Goose handles the text and returns it. How would I solve it in this case?

EDIT: the following doesn't make a difference either

text = unicode(article.cleaned_text,'utf-8')
6
  • Are you sure that the error is caused by the g.extract call? Or does it happen when you try to convert its result to a string later? Commented Sep 18, 2013 at 6:30
  • In order to help with solution, please, post all stack trace error. Commented Sep 18, 2013 at 6:33
  • The coding comment at the top only applies to how the Python compiler interprets the source code; data read from elsewhere, or sent elsewhere, is encoded and decoded according to different rules altogether. Commented Sep 18, 2013 at 6:51
  • @JohannesCharra looking at the stack trace it seems like the error is caused by the extracted article text (being converted to a string?). Commented Sep 18, 2013 at 7:26
  • @MartijnPieters How can I solve this problem? it seems like you may know a solution, thanks. Commented Sep 21, 2013 at 19:15

4 Answers 4

1

you may try raw_html extraction: https://github.com/grangier/python-goose#known-issues

you may do some encoding/decoding with the raw html.

Sign up to request clarification or add additional context in comments.

Comments

0

Maybe it helps to use unicode for all strings: Insert from __future__ import unicode_literals at the very first line of your python file and re-try...

2 Comments

Did you try the article extraction without django in a simple toy project? Do you get the same error?
I don't know whether it's good or bad news, but I can reproduce your error... It fails in command line when extracting this specific url (and works for some other urls, for some it fails silently). This sounds like it's time for a bug report. E.g. nytimes.com/2013/09/22/technology/… is not extracted correctly (but raises no exceptions)
0

Try adding a little u before the string. I don't see any weird characters there, but I usually use hebrew in my django code and the bash at the top is not always enough

article = g.extract(url=u"http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

2 Comments

I think it's actually the text being extracted, not the URL, how can i solve it in this case?
I think you're right. few things to look into: 1. change the bash to the django bash (you're using python's, you should use django's: #encoding=utf-8) 2. use unicode(x, 'utf-8') and other encodine\decoding tools (including ugettext and such) and play around with it, see what happens
0

Even though I can't reproduce error with this URL, I had similar problems with python-goose. Try:

from goose.configuration import Configuration
from goose import Goose


config = Configuration()
config.parser_class = 'soupparser' # this helped me
g = Goose(config)
article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.