Skip to main content
Tweeted twitter.com/StackCodeReview/status/1543837013106892800
spelling
Source Link
Reinderien
  • 71.1k
  • 5
  • 76
  • 256

This function takes as an argument a jsonJSON file (could contain anything in jsonJSON format, since I scrapscrape hundreds of random pages) and returns a list of dictionaries where a URL is mapped to its corresponding headers, based on the extraction of headers using beautifulsoupBeautifulSoup and a regex pattern.

This function takes as an argument a json file (could contain anything in json format, since I scrap hundreds of random pages) and returns a list of dictionaries where a URL is mapped to its corresponding headers, based on the extraction of headers using beautifulsoup and a regex pattern.

This function takes as an argument a JSON file (could contain anything in JSON format, since I scrape hundreds of random pages) and returns a list of dictionaries where a URL is mapped to its corresponding headers, based on the extraction of headers using BeautifulSoup and a regex pattern.

json example added.
Source Link
oba2311
  • 197
  • 1
  • 1
  • 8

Example of json input:

[["<body class=\" catalog-category-view categorypath-sale-html category-sale\">\n<script type=\"text/javascript\">\n//<![CDATA[\nif (typeof(Varien.searchForm) !== 'undefined') {\n    Varien.searchForm.prototype._selectAutocompleteItem = function(element) {\n        var link = element.down();\n        if (link && link.tagName == 'A') {\n            setLocation(link.href);\n        } else {\n            if (element.title){\n                this.field.value = element.title;\n            }\n            this.form.submit();\n        }\n    };\n    Varien.searchForm.prototype.initAutocomplete = function(url, destinationElement) {\n        new Ajax.Autocompleter(\n            this.field,\n            destinationElement,\n            url,\n            {\n                paramName: this.field.name,\n                method: 'get',\n                minChars: 2,\n                frequency: .3,\n                updateElement: this._selectAutocompleteItem.bind(this),\n                onShow : function(element, update) {\n                    if(!update.style.position || update.style.position=='absolute') {\n                        update.style.position = 'absolute';\n                        Position.clone(element, update, {\n                            setHeight: false,\n                            offsetTop: element.offsetHeight\n                        });\n                    }\n                    Effect.Appear(update,{duration:0});\n                }\n\n            }\n        );\n    };\n    Autocompleter.Base.prototype.markPrevious = function() {\n        if (this.index > 0) {\n            this.index--;\n        } else {\n            this.index = this.entryCount - 1;\n        }\n        var entry = this.getEntry(this.index);\n        if (entry.select('a').length === 0) {\n            this.markPrevious(); // Ignore items that don't have link\n        }\n    };\n    Autocompleter.Base.prototype.markNext = function() {\n        if (this.index < this.entryCount - 1) {\n   

Example of json input:

[["<body class=\" catalog-category-view categorypath-sale-html category-sale\">\n<script type=\"text/javascript\">\n//<![CDATA[\nif (typeof(Varien.searchForm) !== 'undefined') {\n    Varien.searchForm.prototype._selectAutocompleteItem = function(element) {\n        var link = element.down();\n        if (link && link.tagName == 'A') {\n            setLocation(link.href);\n        } else {\n            if (element.title){\n                this.field.value = element.title;\n            }\n            this.form.submit();\n        }\n    };\n    Varien.searchForm.prototype.initAutocomplete = function(url, destinationElement) {\n        new Ajax.Autocompleter(\n            this.field,\n            destinationElement,\n            url,\n            {\n                paramName: this.field.name,\n                method: 'get',\n                minChars: 2,\n                frequency: .3,\n                updateElement: this._selectAutocompleteItem.bind(this),\n                onShow : function(element, update) {\n                    if(!update.style.position || update.style.position=='absolute') {\n                        update.style.position = 'absolute';\n                        Position.clone(element, update, {\n                            setHeight: false,\n                            offsetTop: element.offsetHeight\n                        });\n                    }\n                    Effect.Appear(update,{duration:0});\n                }\n\n            }\n        );\n    };\n    Autocompleter.Base.prototype.markPrevious = function() {\n        if (this.index > 0) {\n            this.index--;\n        } else {\n            this.index = this.entryCount - 1;\n        }\n        var entry = this.getEntry(this.index);\n        if (entry.select('a').length === 0) {\n            this.markPrevious(); // Ignore items that don't have link\n        }\n    };\n    Autocompleter.Base.prototype.markNext = function() {\n        if (this.index < this.entryCount - 1) {\n   
Source Link
oba2311
  • 197
  • 1
  • 1
  • 8

Performance and Readability Improvements for HTML Parser with BeautifulSoup

This function takes as an argument a json file (could contain anything in json format, since I scrap hundreds of random pages) and returns a list of dictionaries where a URL is mapped to its corresponding headers, based on the extraction of headers using beautifulsoup and a regex pattern.

I'm looking for suggestions regarding performance readability and clarity.

Following my 1st iteration I improved my code and here is the result:

import json
from tqdm import tqdm
import re
from bs4 import BeautifulSoup
import csv
import string

"Load HTML body, and fetch headers"
def get_headers_from_json(local_path):
    """
    The function takes a json file with html_body and returns a list of headers.
    It parses the titles, based on tags starting with 'h' + num.
    """
    data = json.loads(open(local_path).read())
    pattern = re.compile(r"^h[0-9]$")
    headers_urls = []
    printable = set(string.printable)
    for dict in tqdm(data):
        headers = []
        for val in dict.values():
            soup = BeautifulSoup(val, 'html.parser')
            url = dict.values()[0]
        for element in soup.find_all(pattern):
            element = element.get_text().strip().encode('utf-8')
            element = filter(lambda word: word in printable, element)  
            headers.append(element)
        cleaned_data = {"url": url, "headers": headers}
        headers_urls.append(cleaned_data)
    return headers_urls