0

I am using Scrapy for scraping a Persian website.

title = response.xpath('//*[@id="news"]/div/div[2]/div[2]/div[2]/div[2]/div[2]/h1/a/text()').extract()

When I extract title from the site, it's give me encoded string like this:

[u' \t\t\u0628\u06cc\u0645\u0647 10 \u0633\u0627\u0644\u0647\u200c \u062f\u0631 \u062e\u0637 \u062d\u0645\u0644\u0647\u200c\u06cc \u062a\u06cc\u0645 \u0645\u0644\u06cc \t']

After search for decode string in Python I find this way:

title = response.xpath('//*[@id="news"]/div/div[2]/div[2]/div[2]/div[2]/div[2]/h1/a/text()').extract()

print(title[0].decode('utf-8'))

When I run this code it shows me this:

  print(title[0].decode('utf-8'))
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
  return codecs.utf_8_decode(input, errors, True)

What is the problem?

1 Answer 1

3

Your string is already fine, it's only represented by unicode escapes rather than actual glyphs, so that it can be shown in ASCII consoles as well. Try printing it:

>>> x = [u' \t\t\u0628\u06cc\u0645\u0647 10 \u0633\u0627\u0644\u0647\u200c \u062f\u0631 \u062e\u0637 \u062d\u0645\u0644\u0647\u200c\u06cc \u062a\u06cc\u0645 \u0645\u0644\u06cc \t']
>>> print x[0]
        بیمه 10 ساله‌ در خط حمله‌ی تیم ملی
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.