Python Decode UTF-8 Not working

Question

I am using Scrapy for scraping a Persian website.

title = response.xpath('//*[@id="news"]/div/div[2]/div[2]/div[2]/div[2]/div[2]/h1/a/text()').extract()

When I extract title from the site, it's give me encoded string like this:

[u' \t\t\u0628\u06cc\u0645\u0647 10 \u0633\u0627\u0644\u0647\u200c \u062f\u0631 \u062e\u0637 \u062d\u0645\u0644\u0647\u200c\u06cc \u062a\u06cc\u0645 \u0645\u0644\u06cc \t']

After search for decode string in Python I find this way:

title = response.xpath('//*[@id="news"]/div/div[2]/div[2]/div[2]/div[2]/div[2]/h1/a/text()').extract()

print(title[0].decode('utf-8'))

When I run this code it shows me this:

  print(title[0].decode('utf-8'))
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
  return codecs.utf_8_decode(input, errors, True)

What is the problem?

Stefano Sanfilippo · Accepted Answer · 2015-09-28 09:30:25Z

3

Your string is already fine, it's only represented by unicode escapes rather than actual glyphs, so that it can be shown in ASCII consoles as well. Try printing it:

>>> x = [u' \t\t\u0628\u06cc\u0645\u0647 10 \u0633\u0627\u0644\u0647\u200c \u062f\u0631 \u062e\u0637 \u062d\u0645\u0644\u0647\u200c\u06cc \u062a\u06cc\u0645 \u0645\u0644\u06cc \t']
>>> print x[0]
        بیمه 10 ساله‌ در خط حمله‌ی تیم ملی

answered Sep 28, 2015 at 9:30

Stefano Sanfilippo

33.2k7 gold badges85 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Decode UTF-8 Not working

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related