I'm writing a website scraper (using lxml and py3k on Windows 8) for http://www.delfi.lt - the goal is to output certain information to a .txt file. Obviously ASCII can't work as an encoding due to the website being in Lithuanian, so I attempt to print it in UTF-8. However, not all of the non-ASCII characters are being printed out to the file correctly.
An example of this is where I get DELFI Žinios > Dienos naujienos > Užsienyje as opposed to DELFI Žinios > Dienos naujienos > Užsienyje.
Here is as far as I've gotten with the scraper:
from lxml import html
import sys
# Takes in command line input, namely the URL of the story and (optionally) the name of the CSV file that will store all of the data
# Outputs a list consisting of two strings, the first will be the URL, and the second will be the name if given, otherwise it'll be an empty string
def accept_user_input():
if len(sys.argv) < 2 or len(sys.argv) > 3:
raise type('IncorrectNumberOfArgumentsException', (Exception,), {})('Should have at least one, up till two, arguments.')
if len(sys.argv) == 2:
return [sys.argv[1], '']
else:
return sys.argv[1:]
def main():
url, name = accept_user_input()
page = html.parse(url)
title = page.find('//h1[@itemprop="headline"]')
category = page.findall('//span[@itemprop="title"]')
with open('output.txt', encoding='utf-8', mode='w') as f:
f.write((title.text) + "\n")
f.write(' > '.join([x.text for x in category]) + '\n')
if __name__ == "__main__":
main()
An example run: python scraper.py http://www.delfi.lt/news/daily/world/ukraina-separatistai-siauteja-o-turcynovas-atnaujina-mobilizacija.d?id=64678799 results in a file called output.txt containing
Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ
DELFI Žinios > Dienos naujienos > Užsienyje
as opposed to
Ukraina: separatistai siautÄja, O. TurÄynovas atnaujina mobilizacijÄ
DELFI Žinios > Dienos naujienos > Užsienyje
How do I make the script output all of the text correctly?