How to get links using selenium and scrape using beautifulsoup?

Question

I want to collect articles from this particular website. I was using Beautifulsoup only earlier but it was not grabbing the links. So I tried to use selenium. Now I tried to write this code. This is giving output 'None'. I have never used selenium before, so I don't have much idea about it. What should I change in this code to make it work and give the desired results?

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait

base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'

browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
wait = WebDriverWait(browser, 10)
browser.get(url)

link = browser.find_elements_by_class_name('gs-title')
for links in link:
    links.get_attribute('href')
    soup = BeautifulSoup(browser.page_source, 'lxml')
    date = soup.find('span', {'class': 'post-date'})
    title = soup.find('h1', {'class':'headline'})
    content = soup.find('div',{'class':'article-body'})
    print(date)
    print(title)
    print(content)

    time.sleep(3)
browser.close()

I want to collect the date, title, and content from all the articles on this page and other pages also like page no 7 to 18.

Thank you.

Argon · Accepted Answer · 2019-06-14 06:18:39Z

Instead of using Selenium to get the anchors, I tried to extract the page source first with the help of Selenium and then used Beautiful Soup on it.

So, to put it in perspective:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait

base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'

browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
#wait = WebDriverWait(browser, 10) #Not actually required
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser') #Get the Page Source
anchors = soup.find_all("a", class_ = "gs-title") #Now find the anchors

for anchor in anchors:
    browser.get(anchor['href']) #Connect to the News Link, and extract it's Page Source
    sub_soup = BeautifulSoup(browser.page_source, 'html.parser')
    date = sub_soup.find('span', {'class': 'post-date'})
    title = sub_soup.find('h1', {'class':'post-title'}) #Note that the class attribute for the heading is 'post-title' and not 'headline'
    content = sub_soup.find('div',{'class':'article-body'})
    print([date.string, title.string, content.string])

    #time.sleep(3) #Even this I don't believe is required
browser.close()

With this modification, I believe you can get your required contents.

Thanks, It is working. But Why it is printing or giving each article twice?
If you see the page source for the url, then you'll find that there are 2 places within every result area where a tag with class="gs-title" is placed. Both of them are div in nature, but they differ in their class. One has class = "gsc-thumbnail-inside" & the other has class = "gs-title gsc-table-cell-thumbnail gsc-thumbnail-left". I believe this can be easily resolved by checking at the beginning of every loop that whether current anchor value is similar to previous anchor value.

QHarr · Accepted Answer · 2019-06-14 08:36:10Z

You can use same API as page uses. Alter parameters to get all pages of results

import requests
import json
import re

r = requests.get('https://cse.google.com/cse/element/v1?rsz=filtered_cse&num=10&hl=en&source=gcsc&gss=.uk&start=60&cselibv=5d7bf4891789cfae&cx=012545676297898659090:wk87ya_pczq&q=cybersecurity&safe=off&cse_tok=AKaTTZjKIBzl-5fANH8dQ8f78cv2:1560500563340&filter=0&sort=date&exp=csqr,4229469&callback=google.search.cse.api3732')
p = re.compile(r'api3732\((.*)\);', re.DOTALL)
data = json.loads(p.findall(r.text)[0])
links = [item['clicktrackUrl'] for item in data['results']]
print(links)

Collectives™ on Stack Overflow

How to get links using selenium and scrape using beautifulsoup?

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related