Extracting URL basis text of innerhtml Python

Question

I have multiple websites and i want to get the "Contact Us" Url for each of the website. The Urls are not necessarily contained in same class for all websites. However, the innerHTML of all the websites essentially contains the word "contact"

Is there a way to extract URL from a webpage, if the innerhtml contains specific word. For example, in case of below HTML, i want to extract the URL if the innerhtml contains the word "contact" ( case insensitive ).

HTML = {
<a class="" style="COLOR: #000000; TEXT-DECORATION: none" href="http://www.candp.com/bin/index.asp?id=565B626C6C6A79504B575A4D626E" target=
"_parent">
   <font size="2">
      <strong>Contact Us</strong>
   </font>
</a>
}

output required :-

'http://www.candp.com/bin/index.asp?id=565B626C6C6A79504B575A4D626E'

I could reach to below code so far, but it doesn't seem to work:-

link=[]
driver.get(main_url)
elements = driver.find_elements_by_xpath("//a").get_attribute('href')   #  the href is not always contained in a tag
for el in elements:
    if 'contact'.casefold() in str(el.text):
         link.append(el.get_attribute('href'))

Any help is greatly appreciated,

Dharman · Accepted Answer · 2020-09-11 19:07:18Z

1

Try this:-

r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all("a")
link=[]
for link in links:
    if 'contact' in link.text.lower():
          link.append(link.get(a.href))

The output for the url you have mentioned is :-

<a href="http://www.candp.com/bin/index.asp?id=565B626C686E79504B575A4D626E" target="_blank"><font face="Verdana" size="1">Get more details</font></a>

edited Sep 11, 2020 at 19:07

Dharman♦

33.9k27 gold badges103 silver badges156 bronze badges

answered Sep 11, 2020 at 19:01

ss_0708

1931 silver badge11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

frianH · Accepted Answer · 2020-09-11 20:22:30Z

1

Try following code:

link=[]
elements = driver.find_elements_by_xpath("//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') , 'contact')]")
for el in elements:
    link.append(el.get_attribute("href"))

answered Sep 11, 2020 at 20:22

frianH

7,5916 gold badges26 silver badges49 bronze badges

1 Comment

Renu sharma Over a year ago

it gives an empty list

Collectives™ on Stack Overflow

Extracting URL basis text of innerhtml Python

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related