Loop not working for scraping data using python and beautifulsoup4

Question

My goal is to scrape data from the PGA website to extract all the golf course locations in the USA. I aim to scrape from the 907 pages the name, address, ownership, phone number, and website.

I have created the script below but when the CSV is created it produces errors. The CSV file created from the script has data repetitions of the first few pages and the pages of the website. It does not give the whole data of the 907 pages.

How can I fix my script so that it will scrape all 907 pages and produce a CSV with all the golf courses listed on the PGA website?

Below is my script:

import csv
import requests 
from bs4 import BeautifulSoup

for i in range(907):      # Number of pages plus one 
     url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
     r = requests.get(url)
     soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)

I don’t understand. You are fetching page contents 907 times, but only processing it the last time? — xrisk
– xrisk, Commented Jun 27, 2015 at 4:29
I am trying to extract data from 907 pages or instances from the PGA website and I am trying to do it one process by creating a loop that will go through the website and collect all the data. There about 907 pages worth of data I need collecting but my loop is not working. — Gonzalo68
– Gonzalo68, Commented Jun 27, 2015 at 4:32
Yes, but every time you call soup = BeautifulSoup(r.content) you are losing all the data of the previous page. You need to parse the current webpage, before fetching a new one. — xrisk
– xrisk, Commented Jun 27, 2015 at 4:34
By parse, I mean collect all the information and save it to the csv file (the second part of your script) — xrisk
– xrisk, Commented Jun 27, 2015 at 4:35
Thats what I thought I did. How can I go about without losing the data? Can you help me build the script? — Gonzalo68
– Gonzalo68, Commented Jun 27, 2015 at 4:36

xrisk · Accepted Answer · 2015-06-28 02:45:40Z

Her is the code that you want. It will first parse the current page before going on to the next one. (There are some blank rows, I hope you can fix that yourself).

import csv
import requests 
from bs4 import BeautifulSoup


def encode(l):
    out = []
    for i in l:
        text = str(i).encode('utf-8')
        out.append(''.join([i if ord(i) < 128 else ' ' for i in text])) #taken from Martjin Pieter's answer 
        # http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space/20078869#20078869
    return out

courses_list = []
for i in range(5):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    g_data2=soup.find_all("div",{"class":"views-field-nothing"})

    for item in g_data2:
        try:
              name = item.contents[1].find_all("div",{"class":"views-field-title"})[0].text

        except:
              name=''
        try:
              address1= item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
        except:
              address1=''
        try:
              address2= item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
        except:
              address2=''
        try:
              website= item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
        except:
              website=''   
        try:
              Phonenumber= item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
        except:
              Phonenumber=''      

        course=[name,address1,address2,website,Phonenumber]

        courses_list.append(encode(course))


with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)

EDIT: After the inevitable problems of unicode encoding/decoding, I have modified the answer and it will (hopefully) work now. But http://nedbatchelder.com/text/unipain.html see this.

I get this error. Hoe do I fix this. /Final_PGA2.py", line 44, in <module> writer.writerow(row) UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 35: ordinal not in range(128)
@Gonzalo68 Yes, it is a problem with the csv writer, it cannot handle Unicode properly. I am modifying my answer. Check it out.

Collectives™ on Stack Overflow

Loop not working for scraping data using python and beautifulsoup4

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related