1

I would like to scrape multiple URLs using Selenium. However, I still get only one URL scraped. What could be wrong with the code? Thank you!


    def __init__(self):
        
        #headless options
        options = Options()
        options.add_argument('--no-sandbox')
        options.add_argument("--headless")
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        options.add_experimental_option('useAutomationExtension', False)
        self.driver = webdriver.Chrome('path',options=options)

    
    def parse(self,response):

        start_urls = [
        'https://www.milieuproperties.com/search-results.aspx?paramb=ADVANCE%20SEARCH:%20Province%20(Western%20Cape),%20%20Area%20(Cape%20Town)',
        'https://www.milieuproperties.com/search-results-rent.aspx?paramb=ADVANCE%20SEARCH:%20Province%20(Western%20Cape),%20Rental%20To%20'
    ]
        links = []

        for url in start_urls:
            self.driver.get(url)
            current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
            while True:
                links.extend([link.get_attribute('href') for link in self.driver.find_elements_by_css_selector('.hoverdetail a')])
                try: 
                    elem = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ContentPlaceHolder1_lvDataPager1"]/a[text()="Next" and not(@class)]')))
                    elem.click()
                except TimeoutException:
                    break
                WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text != current_page_number)
                current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text

1 Answer 1

1

For some reason server returns Status 500

Server Error in '/' Application. Column '<SOME COLUMN NAME>' does not belong to table

when trying to send several requests. Guess it's some kind of an anti-bot. You can use workaround - open URLs not in single but in different sessions: just put your self.driver definition inside for loop and quit it at the end of each iteration:

for url in start_urls:
    self.driver = webdriver.Chrome('path',options=options)
    self.driver.get(url)
    ...
    ...
    ...
       current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
    self.driver.quit()
Sign up to request clarification or add additional context in comments.

4 Comments

my list of links still gets filled only with the links from the first URL, but I can see it goes to the second URL just does not take the links from there. Any idea why? @JaSON
@saraherceg seem that element has different locators on first and second URL. Try to replace '.hoverdetail a' with '.hoverdetail a, .ct-hover a'
hmm still only links from one page even when I replace it. Thank you btw for your help very much! @JaSON
@saraherceg are you sure? I got results from both URLs with [link.get_attribute('href') for link in driver.find_elements_by_css_selector('.hoverdetail a, .ct-hover a')]

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.