0

I am trying to scrape the date and url of the article from here. While I do get the list of dates and the headlines of the articles(in text) I am failing to get Urls for the same. This is how I am getting the url headlines in text and the dates.

def sb_rum():
    websites = ['https://www.thespiritsbusiness.com/tag/rum/']
    for spirits in websites:
        browser.get(spirits)
        time.sleep(1)

        news_links = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/h3')
        n_links = [ele.text for ele in news_links]
        dates = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/small')
        n_dates = [ele.text for ele in dates]
        print(n_links)
        print(n_dates)

This gives me an output like

['Harpalion Spirits expands UK distribution', 'Bacardí gets fruity with new tropical rum', 'The world’s biggest-selling rums', 'Havana Club releases Tributo 2021 rum', 'Ron Santiago de Cu
ba rum revamps range', 'Michael B Jordan to change rum name after backlash', 'WIRD recognised for sustainable sugarcane practices', 'Rockstar Spirits advocates for UK-Australia trade deal
', 'Rum Brand Champion 2021: Tanduay', 'Dictador and Niepoort partner on new rum', 'Rockstar Spirits secures £25,000 Dragons’ Den funding', 'SB meets… Lucia Alliegro, Ron Carúpano', 'Brun
o Mars debuts Selvarey Coconut rum', 'Diplomático launches Mixed Consciously cocktail comp', 'Foursquare Distillery backs rum history research', 'Ron Cabezon signs distribution with Gordo
n & MacPhail', 'Havana Club launches smoky rum finished in whisky casks', 'Ron Colón and Bacoo Rum expand distribution', 'Harpalion Spirits launches Pedro Ximénez cask-finished rum', 'Rum
’s journey to premiumisation']
['July 13th, 2021', 'July 8th, 2021', 'July 6th, 2021', 'June 30th, 2021', 'June 29th, 2021', 'June 24th, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 21st, 2021', 'June 18th, 2021'
, 'June 11th, 2021', 'June 7th, 2021', 'June 4th, 2021', 'June 2nd, 2021', 'May 28th, 2021', 'May 28th, 2021', 'May 26th, 2021', 'May 26th, 2021', 'May 24th, 2021', 'May 20th, 2021']

But I simply want to get the url link for the same. For instance, I am able to extract the link for one, but I fail to extract for all. To get the links for all I try something like

n_links = [ele.get_attribute('href') for ele in news_links.find_elements_by_tag_name('a')]

How can it be done? Please help.

1
  • You can use beautifulsoup to parse the html. inbuilt selenium parser is slow and brings weird issues. Commented Aug 9, 2021 at 7:48

2 Answers 2

1

Working solution,

n_links  = [ele.find_element_by_tag_name('a').get_attribute('href') for ele in news_links]
Sign up to request clarification or add additional context in comments.

Comments

1

I don't think you need selenium to scrape this webpage. I have used beautifulsoup to scrape the data you need.

Here is the Code:

import bs4 as bs
import requests

url = 'https://www.thespiritsbusiness.com/tag/rum/'
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, 'lxml')
divs = soup.findAll('div', class_='archiveEntry')
urls = []
titles = []
dates = []
for i in divs:
    urls.append(i.find('a')['href'].strip())
    titles.append(i.find('h3').text.strip())
    dates.append(i.find('small').text.strip())

1 Comment

Thanks alot Ram for your immediate help, I used selenium simply because I wanted to scrape similar websites, although I have only mentioned one hear. But your solution is working fine!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.