Extracting necessary records from LinkedIn

Question

I wanted to create a scraper in python which can fetch required data from LinkedIn. I tried with python in many different ways but I could not make it until I used selenium in combination with. However, I have created it and got it working as I wanted.

The most difficult part I had to face while making this crawler is that there are hundreds of profile pages which can be located with mostly three different XPath patterns. I somehow managed to cement the three different XPath patterns into one. Now it is working great.

This scraper firstly clicks on the view all recommendation tab in home page then parse 200 profiles [customized in this case] by going to the main page of each profile. I've tried to make it error-free. Here is what I've done:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def producing_links(driver, wait):

    driver.get('https://www.linkedin.com/')

    driver.find_element_by_xpath('//*[@id="login-email"]').send_keys('someusername')
    driver.find_element_by_xpath('//*[@id="login-password"]').send_keys('somepassword')
    driver.find_element_by_xpath('//*[@id="login-submit"]').click()

    wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(@class,'feed-s-follows-module__view-all')]")))
    driver.find_element_by_xpath("//a[contains(@class,'feed-s-follows-module__view-all')]").click()

    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(@class,'feed-s-follow-recommendation-card__profile-link')]")))
        links = [item.get_attribute("href") for item in driver.find_elements_by_xpath("//a[contains(@class,'feed-s-follow-recommendation-card__profile-link')]")]
        if (len(links) == 200): 
            break


    for link in links:
        get_docs(driver, wait, link)

def get_docs(driver, wait, name_link):

    driver.get(name_link)
    try:
        for item in driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
            name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
            title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text
    except Exception as e:
        print(e)

    finally:
        try:
            print(name, title)
        except Exception as ex:
            print(ex)


if __name__ == '__main__':

    driver = webdriver.Chrome()
    wait = WebDriverWait(driver, 10)

    try:
        producing_links(driver, wait)
    finally:
        driver.quit()

Is this your own code? This question appears to have been posted before, under a different account. — 200_success
– 200_success, Commented Jul 17, 2017 at 22:13

alecxe · Accepted Answer · 2017-07-17 14:13:10Z

I would recommend a more modular design - having a LinkedInScraper class, initialized with a login and password and with separate methods for logging in and getting profile links.

Also, I think you are overusing the XPaths overall - whenever possible, first explore if you can use "by id", "by name" or "by css selector" locators and fall back to XPath only if you cannot get to the element with other locators.

Also note that wait.until combined with built-in expected conditions returns a WebElement instance - if you are waiting for a specific element and then clicking it - you can do it in one go without re-finding the element.

Unfortunately, cannot test the below code (for some reason, I don't see the recommendation link on the main page when logging in with my credentials), but hope this is still useful:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class LinkedInScraper:
    def __init__(self, username, password):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

        self.login(username, password)

    def __del__(self):
        self.driver.close()

    def login(self, username, password):
        self.driver.get('https://www.linkedin.com/')

        self.driver.find_element_by_id('login-email').send_keys(username)
        self.driver.find_element_by_id('login-password').send_keys(password)
        self.driver.find_element_by_id('login-submit').click()

    def links(self):
        follow_link = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follows-module__view-all")))
        follow_link.click()

        while True:
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

            self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follow-recommendation-card__profile-link")))

            links = [item.get_attribute("href") for item in self.driver.find_elements_by_css_selector("a.feed-s-follow-recommendation-card__profile-link")]
            if len(links) == 200:
                break

        return links

    def profiles(self):
        for link in self.links():
            yield from self.profile(link)

    def profile(self, profile_link):
        self.driver.get(profile_link)

        for item in self.driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
            name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
            title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text

            yield (name, title)


if __name__ == '__main__':
    scraper = LinkedInScraper(username='username',
                              password='password')
    for profile in scraper.profiles():
        print(profile)

I am pretty sure we can also refactor the profile() method, but I cannot get to that page in order to see if locators can be simplified.

Thanks sir alecxe, for caring to answer. A little problematic results I'm getting. "<generator object LinkedInScraper.profile at 0x02277090>". Moreover, is it possible to exhaust all the records other than limiting to 200? Thanks again. — MITHU
– MITHU, Commented Jul 17, 2017 at 14:12
@ShahinIqbal ah, sure, needed the yield from, fixed, please try again. — alecxe
– alecxe, Commented Jul 17, 2017 at 14:13
@ShahinIqbal yes, we can go over all the records. I'd love to help but, for some reason, I cannot see that "recommendations" link on the main page. Could you send me a direct link to that page which you then scroll? Thanks. — alecxe
– alecxe, Commented Jul 17, 2017 at 14:14
I thought it is available to everyone cause even if one has a single connection he should have recommendation to broaden the area of connectivity. However, if you can see here then give it a try otherwise just leave it. Thanks for everything. "linkedin.com/feed". — MITHU
– MITHU, Commented Jul 17, 2017 at 14:26
I like the way you write classes, sir alecxe. It is easy to understand. This "yield from" is new to me. Is there any major difference between "yield" and "yield from"? A one-liner explanation would be much appreciable. — MITHU
– MITHU, Commented Jul 17, 2017 at 16:06

Stack Exchange Network

Extracting necessary records from LinkedIn

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Extracting necessary records from LinkedIn

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions