Python For loop and BeautifulSoup results into CSV

Question

I have scraped address data from a site using a For loop which prints out the result. I would like to output all of the results from each of the For loops into separate columns in a CSV file. When I try to output I only get the last iteration through the For loops and all of the bulky HTML code is included. The print street.text prints all of the data located within all span tags that have itemprop = address.

Here is my code so far:

soup = BeautifulSoup(response, "lxml");

for address in soup.find_all('span', {'itemprop' : 'address'}):
    print address.text

HTML in question:

<span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" id="yui_3_15_0_1_1405702066072_1576"><a href="/homedetails/403-James-Toney-Dr-Elon-NC-27244/96315551_zpid/" class="hdp-link routable" title="403 James Toney Dr, Elon, NC Real Estate" id="yui_3_15_0_1_1405702066072_1575"><span itemprop="streetAddress" id="yui_3_15_0_1_1405702066072_1574">403 James Toney Dr</span>, <span itemprop="addressLocality">Elon</span>, <span itemprop="addressRegion" id="yui_3_15_0_1_1405702066072_1580">NC</span><span itemprop="postalCode" class="hide">27244</span></a></span>

This prints all of the data in order. However I need to get each instance of address into a CSV column. Is this possible? I was thinking I needed a way to store each address into a variable as the loop iterates, but I've read that this might not be a good solution. I've looked on several websites trying to figure out a solution. I feel like it should be pretty simple, I just can't figure it out.

Edit: I was able to use only 1 loop to get all of the info I need. Hopefully this makes the problem simpler.

Show the relevant part of html source or share the link to the website. — alecxe
– alecxe, Commented Jul 18, 2014 at 17:27
I am not sure why you think this makes it simpler. Could you show an example of the HTML? — Krumelur
– Krumelur, Commented Jul 18, 2014 at 17:33
@alecxe I've included the source HTML I'm interested in, is there enough info there? — Steve
– Steve, Commented Jul 18, 2014 at 17:41
@Krumelur I posted the HTML. I think it's simpler to deal with 1 loop rather than 4, that was my logic. — Steve
– Steve, Commented Jul 18, 2014 at 17:42

Krumelur · Accepted Answer · 2014-07-18 17:31:49Z

1

In general you need to give an example of the HTML to give a correct answer.

In the special case of there always being exactly the same amount of each and they correlate in order, you could use the zip function, e.g.

streets = [street.text for street in soup.find_all(...)
towns = ....
states = ....
zips = ....

recs = zip(streets,towns,states,zips)

c = csv.writer(...)
for rec in recs:
    c.writerow(rec)

answered Jul 18, 2014 at 17:31

Krumelur

32.8k10 gold badges82 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Steve Over a year ago

I've updated my question to include only the 1 loop. I also included the part of the HTML source I'm interested in. I think that makes the process easier, can you explain how this would work with just 1 For loop?

Krumelur Over a year ago

Looking at the HTML, I think the best way is to nest it, e.g. using "address" for looping then getting each element starting with the address span.

Steve Over a year ago

That is exactly what I did . I used a dictionary, addresses = [] and looped through the address and appending each address.text to my dictionary. Then I simply used writer.writerow(addresses) to get the CSV output I needed. Thanks.

alecxe · Accepted Answer · 2014-07-18 17:47:09Z

Here's the complete solution.

The idea is to use csv.DictWriter class and rely on the fact that all of the fields you need in csv are presented with span tags with itemprop attribute:

import csv
from bs4 import BeautifulSoup


data = """your html here"""
fieldnames = ['streetAddress', 'addressLocality', 'addressRegion', 'postalCode']

soup = BeautifulSoup(data, "lxml")

with open('output.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames)
    for address in soup.find_all('span', itemprop='address'):
        writer.writerow({element['itemprop']: element.text
                         for element in address.find_all('span', itemprop=True)})

After executing, the contents of output.csv for the HTML source you've provided is:

403 James Toney Dr,Elon,NC,27244

Thanks for the detailed response. I ultimately ended up using a different answer for now, but as my project expands I might end up using this method.

Steve · Accepted Answer · 2014-07-18 18:07:32Z

I was able to get the output I wanted using this:

addresses = []

soup = BeautifulSoup(response, "lxml" );

for address in soup.find_all ('span', {'itemprop' : 'address'}):
    addresses.append(address.text)

file = 'output.csv'
with open(file,'wb') as f:
    writer = csv.writer(f)
    writer.writerow(addresses)

Thanks everyone for the help! The lengths of the address are not all the same but it didn't seem to make a difference. All of the addresses were outputted in their full length.

heinst · Accepted Answer · 2014-07-18 18:09:30Z

You could do something like this where you have parallel list, or a dictionary of lists (up to you), this will only work if they are the same size though, you didn't specify so this should work.

from bs4 import BeautifulSoup
import csv
response = '<span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" id="yui_3_15_0_1_1405702066072_1576"><a href="/homedetails/403-James-Toney-Dr-Elon-NC-27244/96315551_zpid/" class="hdp-link routable" title="403 James Toney Dr, Elon, NC Real Estate" id="yui_3_15_0_1_1405702066072_1575"><span itemprop="streetAddress" id="yui_3_15_0_1_1405702066072_1574">403 James Toney Dr</span>, <span itemprop="addressLocality">Elon</span>, <span itemprop="addressRegion" id="yui_3_15_0_1_1405702066072_1580">NC</span><span itemprop="postalCode" class="hide">27244</span></a></span>'

soup = BeautifulSoup(response, "lxml")

addressList = soup.find_all('span', {'itemprop' : 'address'})

with open('addresses.csv', 'w') as f:
    writer = csv.writer(f)
    for address in addressList:
        writer.writerow([address.text])

I updated my question. I only need 1 loop. Would this work using just the 1 loop?
This is almost exactly what I needed. This solution outputted a CSV file with a column for each character in the address. I adapted your answer from before and it worked great. I'll post my answer.
@Steve Sorry, I wrote this on my phone so I didnt get to test it and cool! I'd like to see how you did it
No worries, I wouldn't have figured out the solution had you not made that original posting! My answer is now posted.
Ah I know why! @Steve I forgot to put square brackets somehwere! Now it should be all good

Collectives™ on Stack Overflow

Python For loop and BeautifulSoup results into CSV

4 Answers 4

3 Comments

1 Comment

Comments

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

5 Comments

Related