This function takes as an argument a json file (could contain anything in json format, since I scrap hundreds of random pages) and returns a list of dictionaries where a URL is mapped to its corresponding headers, based on the extraction of headers using beautifulsoup and a regex pattern.
I'm looking for suggestions regarding performance readability and clarity.
Following my 1st iteration I improved my code and here is the result:
import json
from tqdm import tqdm
import re
from bs4 import BeautifulSoup
import csv
import string
"Load HTML body, and fetch headers"
def get_headers_from_json(local_path):
"""
The function takes a json file with html_body and returns a list of headers.
It parses the titles, based on tags starting with 'h' + num.
"""
data = json.loads(open(local_path).read())
pattern = re.compile(r"^h[0-9]$")
headers_urls = []
printable = set(string.printable)
for dict in tqdm(data):
headers = []
for val in dict.values():
soup = BeautifulSoup(val, 'html.parser')
url = dict.values()[0]
for element in soup.find_all(pattern):
element = element.get_text().strip().encode('utf-8')
element = filter(lambda word: word in printable, element)
headers.append(element)
cleaned_data = {"url": url, "headers": headers}
headers_urls.append(cleaned_data)
return headers_urls