1

I want to collect articles from this particular website. I was using Beautifulsoup only earlier but it was not grabbing the links. So I tried to use selenium. Now I tried to write this code. This is giving output 'None'. I have never used selenium before, so I don't have much idea about it. What should I change in this code to make it work and give the desired results?

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait

base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'

browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
wait = WebDriverWait(browser, 10)
browser.get(url)

link = browser.find_elements_by_class_name('gs-title')
for links in link:
    links.get_attribute('href')
    soup = BeautifulSoup(browser.page_source, 'lxml')
    date = soup.find('span', {'class': 'post-date'})
    title = soup.find('h1', {'class':'headline'})
    content = soup.find('div',{'class':'article-body'})
    print(date)
    print(title)
    print(content)

    time.sleep(3)
browser.close()

I want to collect the date, title, and content from all the articles on this page and other pages also like page no 7 to 18.

Thank you.

2 Answers 2

1

Instead of using Selenium to get the anchors, I tried to extract the page source first with the help of Selenium and then used Beautiful Soup on it.

So, to put it in perspective:

import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait

base = 'https://metro.co.uk'
url = 'https://metro.co.uk/search/#gsc.tab=0&gsc.q=cybersecurity&gsc.sort=date&gsc.page=7'

browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
#wait = WebDriverWait(browser, 10) #Not actually required
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser') #Get the Page Source
anchors = soup.find_all("a", class_ = "gs-title") #Now find the anchors

for anchor in anchors:
    browser.get(anchor['href']) #Connect to the News Link, and extract it's Page Source
    sub_soup = BeautifulSoup(browser.page_source, 'html.parser')
    date = sub_soup.find('span', {'class': 'post-date'})
    title = sub_soup.find('h1', {'class':'post-title'}) #Note that the class attribute for the heading is 'post-title' and not 'headline'
    content = sub_soup.find('div',{'class':'article-body'})
    print([date.string, title.string, content.string])

    #time.sleep(3) #Even this I don't believe is required
browser.close()

With this modification, I believe you can get your required contents.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, It is working. But Why it is printing or giving each article twice?
If you see the page source for the url, then you'll find that there are 2 places within every result area where a tag with class="gs-title" is placed. Both of them are div in nature, but they differ in their class. One has class = "gsc-thumbnail-inside" & the other has class = "gs-title gsc-table-cell-thumbnail gsc-thumbnail-left". I believe this can be easily resolved by checking at the beginning of every loop that whether current anchor value is similar to previous anchor value.
0

You can use same API as page uses. Alter parameters to get all pages of results

import requests
import json
import re

r = requests.get('https://cse.google.com/cse/element/v1?rsz=filtered_cse&num=10&hl=en&source=gcsc&gss=.uk&start=60&cselibv=5d7bf4891789cfae&cx=012545676297898659090:wk87ya_pczq&q=cybersecurity&safe=off&cse_tok=AKaTTZjKIBzl-5fANH8dQ8f78cv2:1560500563340&filter=0&sort=date&exp=csqr,4229469&callback=google.search.cse.api3732')
p = re.compile(r'api3732\((.*)\);', re.DOTALL)
data = json.loads(p.findall(r.text)[0])
links = [item['clicktrackUrl'] for item in data['results']]
print(links)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.