2

I'm writing a website scraper (using lxml and py3k on Windows 8) for http://www.delfi.lt - the goal is to output certain information to a .txt file. Obviously ASCII can't work as an encoding due to the website being in Lithuanian, so I attempt to print it in UTF-8. However, not all of the non-ASCII characters are being printed out to the file correctly.

An example of this is where I get DELFI Žinios > Dienos naujienos > Užsienyje as opposed to DELFI Žinios > Dienos naujienos > Užsienyje.

Here is as far as I've gotten with the scraper:

from lxml import html
import sys

# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string
def accept_user_input():
    if len(sys.argv) < 2 or len(sys.argv) > 3:
        raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.')
    if len(sys.argv) == 2:
        return [sys.argv[1], '']
    else:
        return sys.argv[1:]

def main():
    url, name = accept_user_input()
    page = html.parse(url)

    title = page.find('//h1[@itemprop="headline"]')
    category = page.findall('//span[@itemprop="title"]')

    with open('output.txt', encoding='utf-8', mode='w') as f:
        f.write((title.text) + "\n")
        f.write(' > '.join([x.text for x in category]) + '\n')

if __name__ == "__main__":
    main()

An example run: python scraper.py http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799 results in a file called output.txt containing

Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ
DELFI Žinios > Dienos naujienos > Užsienyje

as opposed to

Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ
DELFI Žinios > Dienos naujienos > Užsienyje

How do I make the script output all of the text correctly?

1 Answer 1

3

Using requests and beautifulSoup and letting requests handle the encoding using .content works for me :

import requests
from bs4 import BeautifulSoup

def main():
    url, name = "http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799","foo.csv"
    r = requests.get(url)

    page = BeautifulSoup(r.content)

    title = page.find("h1",{"itemprop":"headline"})
    category = page.find_all("span",{"itemprop":"title"})
    print(title)
    with open('output.txt', encoding='utf-8', mode='w') as f:
        f.write((title.text) + "\n")
        f.write(' > '.join([x.text for x in category]) + '\n')

Output:

Ukraina: separatistai siautėja, O. Turčynovas atnaujina mobilizacijąnaujausi susirėmimų vaizdo įrašai
DELFI Žinios > Dienos naujienos > Užsienyje

Changing the parser encoding also works:

parser = etree.HTMLParser(encoding="utf-8")
page = html.parse(url,parser)

So change your code to the following :

from lxml import html,etree
import sys

# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string
def accept_user_input():
    if len(sys.argv) < 2 or len(sys.argv) > 3:
        raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.')
    if len(sys.argv) == 2:
        return [sys.argv[1], '']
    else:
        return sys.argv[1:]

def main():
    parser = etree.HTMLParser(encoding="utf-8")
    page = html.parse(url,parser))

    title = page.find('//h1[@itemprop="headline"]')
    category = page.findall('//span[@itemprop="title"]')

    with open('output.txt', encoding='utf-8', mode='w') as f:
        f.write((title.text) + "\n")
        f.write(' > '.join([x.text for x in category]) + '\n')

if __name__ == "__main__":
    main()
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.