I have the following the function the retrieves a given webpage and returns the number of adverts that are on the page using a shortened version of EasyList (17,000 rules). Using multiprocessing, this scraped 18,000 pages in just over 2 days (which was fine at the time). However, I now have a dataset that is 10x larger so this runtime isn't particularly ideal. I suspect that it's running in quadraticly due to this line result = len(document.xpath(rule)) in the for loop.
I'm not very familiar with XPath/lxml at all so some advice on how to make this more efficient would be appreciated or at least some indication whether I can make it run much faster or not.
import lxml.html
import requests
import cssselect
import pandas as pd
from multiprocessing import Pool
def count_ads(url):
rules_file = pd.read_csv("easylist_general_hide.csv", sep="\t",header=None)
try:
html = requests.get(url,timeout=5).text
except:
print(f"Page not found or timed out: {url}")
return
count = 0
translator = cssselect.HTMLTranslator()
for rule in rules_file[0]:
try:
rule = translator.css_to_xpath(rule[2:])
document = lxml.html.document_fromstring(html)
result = len(document.xpath(rule))
if result>0:
count = count+result
except:
pass
return count```