1

I am new to python and I am trying to scrape a website. I am able to log in into a website and get a html page, but i dont need the whole page, i just need the hyperlink in the specified table.

I have written the below code, but this gets all the hyperlinks.

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
        for link in soup.findAll('a'):
                print link.get('href')

Can anyone help me where am i going wrong?

Below is the html text of the table

<table id="ctl00_Main_lvMyAccount_Table1" width="680px">
 <tr id="ctl00_Main_lvMyAccount_Tr1">
    <td id="ctl00_Main_lvMyAccount_Td1">
                        <table id="ctl00_Main_lvMyAccount_itemPlaceholderContainer" border="1" cellspacing="0" cellpadding="3">
        <tr id="ctl00_Main_lvMyAccount_Tr2" style="background-color:#0090dd;">
            <th id="ctl00_Main_lvMyAccount_Th1"></th>
            <th id="ctl00_Main_lvMyAccount_Th2">

                                    <a id="ctl00_Main_lvMyAccount_SortByAcctNum" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')">
                                        <font color=white>
                                            <span id="ctl00_Main_lvMyAccount_AcctNum">Account number</span>
                                        </font>

                                        </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th4">
                                    <a id="ctl00_Main_lvMyAccount_SortByServAdd" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_ServiceAddress">Service address</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th5">
                                    <a id="ctl00_Main_lvMyAccount_SortByAcctName" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_AcctName">Name</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th6">
                                    <a id="ctl00_Main_lvMyAccount_SortByStatus" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')">
                                    <font color=white>
                                        <span id="ctl00_Main_lvMyAccount_AcctStatus">Account status</span>
                                    </font>
                                    </a>
                                </th>
            <th id="ctl00_Main_lvMyAccount_Th3"></th>
        </tr>


            <tr>
                <td>

Thanks in advance.

2
  • Which hyperlink of those do you need? Commented Nov 13, 2013 at 12:39
  • all the href for all the a anchor tags, and i have pasted just a part of the html, list there are lot many Commented Nov 13, 2013 at 12:42

3 Answers 3

1

Well, this is the right way to do it.

soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ): 
        for link in table.findAll('a'): #search for links only in the table
                print link['href'] #get the href attribute

Also, you can skip the parent loop, since there would be only one match for the specified id:

soup = BeautifulSoup(the_page)
table = soup.find('table',{'id':'ctl00_Main_lvMyAccount_Table1'})
for link in table.findAll('a'): #search for links only in the table
                print link['href'] #get the href attribute

Update: Noticed what @DSM said. Fixed a missing quote in the table assignment.

Sign up to request clarification or add additional context in comments.

1 Comment

You're missing a ' before ctl.
0

Make sure your for loop looks up in the table html (and not soup variable, which is the page html):

from bs4 import BeautifulSoup

page = BeautifulSoup(the_page)
table = page.find('table', {'id': 'ctl00_Main_lvMyAccount_Table1'})
links = table.findAll('a')

# Print href
for link in links:
   link['href']

Result

In [8]: table = page.find('table', {'id' : 'ctl00_Main_lvMyAccount_Table1'})

In [9]: links = table.findAll('a')

In [10]: for link in links:
   ....:     print link['href']
   ....:     
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')

2 Comments

Traceback (most recent call last): File "C:\MiamiDade_Scraping\latest1.py", line 59, in <module> table = the_page.find('table', {'id': 'ctl00_Main_lvMyAccount_Table1'}) TypeError: slice indices must be integers or None or have an index method -- getting this error
what version of python and beautifulsoup are you using ?
0

Your nested loop for link in soup.findAll('a'): is searching the entire HTML page. If you want to search for links within the table change that line to:

for link in table.findAll('a'):

1 Comment

Thanks it work. but still i need only few links only. Like now i am getting all the links but in these i need only select links and not the delete links. the html code for the selecl link is <a id="ctl00_Main_lvMyAccount_ctrl1_lnkSelect" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$ctrl1$lnkSelect','')">View </a>

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.