0

I have a list of links and for each link I want to check if it contains a specific sublink and add this sublink to the initial list. I have this code:

def getAllLinks():
    i = 0
    baseUrl = 'http://www.cdep.ro/pls/legis/'
    sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']
    while i < len(sourcePaths)+1:
        for path in sourcePaths:
            res = requests.get(f'{baseUrl}{path}')
            soup = BeautifulSoup(res.text)

            next_btn = soup.find(lambda e: e.name == 'td' and '1..99' in e.text)
            if next_btn:
                for a in next_btn.find_all('a', href=True):
                    linkNextPage = a['href']
                    sourcePaths.append(linkNextPage)
                    i += 1
                break

            else:
                i += 1
                continue
            break

    return sourcePaths

print(getAllLinks())

The first link in the list does not contain the sublink, so it's an else case. The code does this OK. However, the second link in the list does contain the sublink, but it gets stuck here:

for a in next_btn.find_all('a', href=True):
    linkNextPage = a['href']
    sourcePaths.append(linkNextPage)
    i += 1

The third link contains the sublink but my code does not get to look at that link. At the end I am getting a list containing the initial links plus 4 times the sublink of the second link.

I think I'm breaking incorrectly somewhere but I can't figure out how to fix it.

2
  • break is only exiting the inner loop, so it's getting "stuck" in the outer while Commented Aug 8, 2020 at 14:40
  • What's that you are trying to achieve? Commented Aug 8, 2020 at 14:42

2 Answers 2

1

Remove the while. It's not needed. Change the selectors

import requests
from bs4 import BeautifulSoup

def getAllLinks():
    baseUrl = 'http://www.cdep.ro/pls/legis/'
    sourcePaths = ['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0','legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0']

    for path in sourcePaths:
        res = requests.get(f'{baseUrl}{path}')
        soup = BeautifulSoup(res.text, "html.parser")

        next_btn = soup.find("p",class_="headline").find("table", {"align":"center"})
        if next_btn:
            anchor = next_btn.find_all("td")[-1].find("a")
            if anchor: sourcePaths.append(anchor["href"])
    return sourcePaths

print(getAllLinks())

Output:

['legis_pck.lista_anuala?an=2012&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=1', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0', 'legis_pck.lista_anuala?an=2020&emi=3&tip=18&rep=0&nrc=100', 'legis_pck.lista_anuala?an=2010&emi=3&tip=18&rep=0&nrc=100']
Sign up to request clarification or add additional context in comments.

1 Comment

:O this is beautiful. Thank you.
0

Your second break statement never gets executed because the first "for" loop is already broken by the first break statement and never reaches the second break statement. Put condition which break the while loop.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.