web crawling using python and selenium

Question

I am trying to crawl data from website but the problem is there is load-more button to view next 50 records, same way i u have to click until records ends.

I am only able to fetch 50 names and addresses. need to fetch all untill load-more ends.

for dynamic click on on button i am using selenium with python.

I wanted to find name, address and contact number of all the retailers city wise

My Try:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException

url = "https://www.test.in/chemists/medical-store/gujarat/surat"
browser = webdriver.Chrome()
browser.get(url)

time.sleep(1)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

try:
    for row in soup.find_all("div", {"class":"listing "}):
        #print(row.get_text())
        name = row.h3.a.string
        address = row.p.get_text()
        #contactnumber = need to find (can view after click on retailer name )
        print(name)
        print(address)
        print(contactnumber)

    button = browser.find_element_by_id("loadmore")
    button.click()

except TimeoutException as ex: 
    isrunning = 0

#browser.close()
#browser.quit()

TheChetan · Accepted Answer · 2017-09-29 18:03:19Z

If you inspect the network calls that are made when you hit load more you can see that its a post request with the parameters being the city, state and the page number. So instead of loading the script in selnium, you can do it with normal requests module instead. For example, this function will do the load more function for you as you iterate through the pages.

def hitter(page):
    url = "https://www.healthfrog.in/importlisting.html"

    payload = "page="+str(page)+"&mcatid=chemists&keyword=medical-store&state=gujarat&city=surat"
    headers = {
        'content-type': "application/x-www-form-urlencoded",
        'connection': "keep-alive",
        'cache-control': "no-cache",
        'postman-token': "d4baf979-343a-46e6-f53b-2f003d04da82"
    }

    response = requests.request("POST", url, data=payload, headers=headers)
    return response.text

The above function fetches for you the html of the page which contains the name and address. Now you can iterate through the pages, until you find one which returns no content. For example, if try with state Karnataka and city as Mysore, you will notice the difference between the third and fourth pages. This will tell you where you have to stop.

To get the phone numbers, you can request for the html from the <h3> tags of the bulk listing response (previous response). Example html:

<div class="listing">
    <h3>
        <a href="https://www.healthfrog.in/chemists/sunny-medical-store-surat-v8vcr3alr.html">Sunny Medical Store</a>
    </h3>
    <p>
        <i class="fa fa-map-marker"></i>155 - Shiv Shakti Society,  Punagam, , Surat, Gujarat- 394210,India
    </p>
</div>

You will need to parse the html and find out where the phone number is, then you can populate it. You can request this example using:

html = requests.get('https://www.healthfrog.in/chemists/sunny-medical-store-surat-v8vcr3alr.html').text

You can now parse the html with beautifulSoup like you have done earlier.

Doing it with requests instead of selenium has many advantages here, you need not open and close multiple windows for each time you need a phone number and you can avoid the element expiring each time you hit load more. Its also much faster.

Please note: If you are doing scraping like this, please abide by the rules set by the site. Do not crash it by sending too many requests.

Edit: Working scraper.

import requests, time, re
from bs4 import BeautifulSoup

def hitter(page, state="Gujarat", city="Surat"):
    url = "https://www.healthfrog.in/importlisting.html"

    payload = "page="+str(page)+"&mcatid=chemists&keyword=medical-store&state="+state+"&city="+city
    headers = {
        'content-type': "application/x-www-form-urlencoded",
        'connection': "keep-alive",
        'cache-control': "no-cache"
    }

    response = requests.request("POST", url, data=payload, headers=headers)
    return response.text

def getPhoneNo(link):
    time.sleep(3)
    soup1 = BeautifulSoup(requests.get(link).text, "html.parser")
    f = soup1.find('i', class_='fa fa-mobile').next_element
    try:
        phone = re.search(r'(\d{10})', f).group(1)
    except AttributeError:
        phone = None
    return phone

def getChemists(soup):
    stores = []
    for row in soup.find_all("div", {"class":"listing"}):
        print(row)
        dummy = {
            'name': row.h3.string,
            'address': row.p.get_text(),
            'phone': getPhoneNo(row.h3.a.get_attribute_list('href')[0])
        }
        print(dummy)
        stores.append(dummy)

    return stores

if __name__ == '__main__':
    page, chemists = 1, []
    city, state = 'Gulbarga', 'Karnataka'
    html = hitter(page, state, city)
    condition = not re.match(r'\A\s*\Z', html)
    while(condition):
        soup = BeautifulSoup(html, 'html.parser')
        chemists += getChemists(soup)
        page += 1
        html = hitter(page, state, city)
        condition = not re.match(r'\A\s*\Z', html)
    print(chemists)

whats new in your script? my script is also does same thing, your script is also fetching only 50 rows, problem is loadmore button thats why i used selenium to crawl entire rows
Don't use selenium when you can hit a simple curl. Just process one page at a time. Process until the response of the page is blank
but your script returns only 50 records, if you will see there is loadmore button, how to deal with that?
Iterate through the pages until you get a empty response. The load more returns an HTML that gets rendered by the front end. What you should do is see the request sent and change its page number with each iteration

Collectives™ on Stack Overflow

web crawling using python and selenium

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related