I have written a script using Beautiful Soup to scrape some HTML and do some stuff and produce HTML back. However, I am not convinced with my code and I am looking for some improvements.
Structure of my source HTML file:
<!DOCTYPE html>
<html>
...
<body>
...
<section id="article-section-1">
                <div id="article-section-1-icon" class="icon">
                    <img src="../images/introduction.jpg" />
                </div>  
                <div id="article-section-1-heading" 
                     class="heading">
                    Some Heading 1
                </div>
                <div id="article-section-1-content" 
                     class="content">   
                    This section can have p, img, or even div tags
                </div>
</section>
...
...
<section id="article-section-8">
                <div id="article-section-8-icon" class="icon">
                    <img src="../images/introduction.jpg" />
                </div>  
                <div id="article-section-8-heading" 
                     class="heading">
                    Some Heading
                </div>
                <div id="article-section-8-content" 
                     class="content">   
                    This section can have p, img, or even div tags
                </div>
</section>
...
</body>
</html>
My code:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(myhtml)
all_sections = soup.find_all('section',id=re.compile("article-section-[0-9]"))
for section in all_sections:
    heading = str(section.find_all('div',class_="heading")[0].text).strip()
    contents_list = section.find_all('div',class_="content")[0].contents
    content = ''
    for i in contents_list:
        if i != '\n':
            content = content+str(i)
    print '<html><body><h1>'+heading+'</h1><hr>'+content+'</body></html>'
My code works perfectly without any issues so far, however, I don't find it pythonic. I believe that it could be done in a much better/simpler way.
- Content_listis a list which has items like- '\n'. With a loop running over this list, I am removing it. Is there any better way?
- I am not interested in article icon, so I am ignoring it in my script.
- I am using stripmethod to remove extra white spaces in the heading. Is there any better way?
- Other than new lines, the divelement within content can have anything, even nesteddivs. So far, I have run my script over a few pages I have and it seems to work. Anything here I need to take care of?
- Lastly, is there any better way to generate HTML files? Once I scraped data, I will work on generating HTML files. These files will have same structure (CSS, JavaScript, etc) and I have to do is put scraped data into it. Can the above method I used (build a string and put content and headings) be improved in any way?
I am not looking for full code in answers; just give me subtle hints or point me in some direction.