Counting Ads in webpage using XPath and EasyList in Python

Question

I have the following the function the retrieves a given webpage and returns the number of adverts that are on the page using a shortened version of EasyList (17,000 rules). Using multiprocessing, this scraped 18,000 pages in just over 2 days (which was fine at the time). However, I now have a dataset that is 10x larger so this runtime isn't particularly ideal. I suspect that it's running in quadraticly due to this line result = len(document.xpath(rule)) in the for loop.

I'm not very familiar with XPath/lxml at all so some advice on how to make this more efficient would be appreciated or at least some indication whether I can make it run much faster or not.

import lxml.html
import requests
import cssselect
import pandas as pd
from multiprocessing import Pool

def count_ads(url):

    rules_file = pd.read_csv("easylist_general_hide.csv", sep="\t",header=None)

    try:
        html = requests.get(url,timeout=5).text
    except:
        print(f"Page not found or timed out: {url}")
        return
    
    count = 0
    translator = cssselect.HTMLTranslator()

    for rule in rules_file[0]:
        try:
            rule = translator.css_to_xpath(rule[2:])
            document = lxml.html.document_fromstring(html)
            result = len(document.xpath(rule))
            if result>0:
                count = count+result
        except:
            pass

    return count```

J_H · Accepted Answer · 2023-03-13 04:21:08Z

You are apparently using this library: https://pypi.org/project/cssselect Measured time to process a scraped page is ~ 10 seconds, and we wish to reduce that.

There are so many essential details left out of the OP, including profiler observations of an actual run.

I can see at least one thing that could be immediately improved. A constant could be hoisted out of a loop.

        html = requests.get(url,timeout=5).text
        ...
    for rule in rules_file[0]:
        ...
            rule = translator.css_to_xpath(rule[2:])
            document = lxml.html.document_fromstring(html)
            result = len(document.xpath(rule))

It looks like constant document parsing could be hoisted, similar to how translator has already been hoisted. No need to recompute it 17 K times, once per rule.

    document = lxml.html.document_fromstring(html)
    for rule in rules_file[0]:
        ...

Presumably a given worker process will handle multiple URLs. So for N pages we invoke .css_to_xpath() 1.7e4 × N times. Looks like there may be an opportunity for caching, here. A naïve approach would just tack on a cache decorator:

@lru_cache(max_size=17_400)
def get_xpath(...):

But there can be some fiddly requirements, such as all arguments being hashable. If you encounter such trouble, don't give up. There must be some way to avoid boring repeated xpath extraction of same old rule data.

Seventeen thousand rules sounds like a lot. I bet some of them trigger often, and some quite seldom, perhaps zero times in your corpus.

You have time and resource constraints. Apparently fewer than twenty days are available. Rank order rules by how useful they are, and run just the top thousand of them against your freshly received pages. Publish preliminary results. Decide if you want to go back and try the next thousand rules, or ten thousand rules.

Look for patterns. Perhaps the URL's hostname predicts which hundred rules are most likely to apply to the page.

There's two parts to this program:

I/O web download, and
compute

The former has a very small memory footprint, unlike the latter. This has implications for scheduling server resources.

Consider breaking out a "fetch" phase that focuses solely on issuing .get()s (with timeout) and then persisting the result to disk.

Then a subsequent "compute" phase can analyze the fetched pages, perhaps in much less than an average of ten seconds.

Benchmark a short run.

Then try it again with pypy. Sometimes that will win.

Thanks for your response. Apologies for omitting details, I wanted to obscure what the function is for. I can't believe I missed moving document so thanks for pointing that out. I did consider filtering the rules down so only the most common ones get tested but I wanted to be thorough. I like the idea of caching and Pypy. — NGH
– NGH, Commented Mar 13, 2023 at 19:41

Stack Exchange Network

Counting Ads in webpage using XPath and EasyList in Python

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Counting Ads in webpage using XPath and EasyList in Python

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions