1

My goal is to scrape data from the PGA website to extract all the golf course locations in the USA. I aim to scrape from the 907 pages the name, address, ownership, phone number, and website.

I have created the script below but when the CSV is created it produces errors. The CSV file created from the script has data repetitions of the first few pages and the pages of the website. It does not give the whole data of the 907 pages.

How can I fix my script so that it will scrape all 907 pages and produce a CSV with all the golf courses listed on the PGA website?

Below is my script:

import csv
import requests 
from bs4 import BeautifulSoup

for i in range(907):      # Number of pages plus one 
     url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
     r = requests.get(url)
     soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)
12
  • I don’t understand. You are fetching page contents 907 times, but only processing it the last time? Commented Jun 27, 2015 at 4:29
  • I am trying to extract data from 907 pages or instances from the PGA website and I am trying to do it one process by creating a loop that will go through the website and collect all the data. There about 907 pages worth of data I need collecting but my loop is not working. Commented Jun 27, 2015 at 4:32
  • Yes, but every time you call soup = BeautifulSoup(r.content) you are losing all the data of the previous page. You need to parse the current webpage, before fetching a new one. Commented Jun 27, 2015 at 4:34
  • By parse, I mean collect all the information and save it to the csv file (the second part of your script) Commented Jun 27, 2015 at 4:35
  • Thats what I thought I did. How can I go about without losing the data? Can you help me build the script? Commented Jun 27, 2015 at 4:36

1 Answer 1

1

Her is the code that you want. It will first parse the current page before going on to the next one. (There are some blank rows, I hope you can fix that yourself).

import csv
import requests 
from bs4 import BeautifulSoup


def encode(l):
    out = []
    for i in l:
        text = str(i).encode('utf-8')
        out.append(''.join([i if ord(i) < 128 else ' ' for i in text])) #taken from Martjin Pieter's answer 
        # http://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space/20078869#20078869
    return out

courses_list = []
for i in range(5):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    g_data2=soup.find_all("div",{"class":"views-field-nothing"})

    for item in g_data2:
        try:
              name = item.contents[1].find_all("div",{"class":"views-field-title"})[0].text

        except:
              name=''
        try:
              address1= item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
        except:
              address1=''
        try:
              address2= item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
        except:
              address2=''
        try:
              website= item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
        except:
              website=''   
        try:
              Phonenumber= item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
        except:
              Phonenumber=''      

        course=[name,address1,address2,website,Phonenumber]

        courses_list.append(encode(course))


with open ('PGA_Data.csv','a') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)

EDIT: After the inevitable problems of unicode encoding/decoding, I have modified the answer and it will (hopefully) work now. But http://nedbatchelder.com/text/unipain.html see this.

Sign up to request clarification or add additional context in comments.

2 Comments

I get this error. Hoe do I fix this. /Final_PGA2.py", line 44, in <module> writer.writerow(row) UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 35: ordinal not in range(128)
@Gonzalo68 Yes, it is a problem with the csv writer, it cannot handle Unicode properly. I am modifying my answer. Check it out.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.