Using web driver to get all text from a source page in python

Question

I am using selenium webdriver (firefox) to crawl some data from a website. I just found that opening the web page is slower than just opening the source of that web page. In other words, it took much longer to go to 'www.google.com' than to go to 'view-source:www.google.com'

So I was wondering whether I can use webdriver to get all text from a source page, rather than a normal page.

I tried using driver.page_source for the source page but it returned some mess that I don't want.

Simon Kirsten · Accepted Answer · 2016-08-12 21:29:21Z

1

If you only need the source use requests. Install it with pip:

pip install requests

And use it like so:

import requests

r = requests.get("http://google.com/")
# r.content, r.text, r.json(), r.status can be used

For advanced usage refer to the documentation above.

Note: If you need to parse the html use BeautifulSoup and pass it r.content.

answered Aug 12, 2016 at 21:29

Simon Kirsten

2,57720 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Marco Over a year ago

Yes, but I have to use web driver because I need to manually pass the rechaptcha check.

Simon Kirsten Over a year ago

This should provide you with options to get the source code. Also, to optimize load speeds you could disable images like here.

jpaugh Over a year ago

@user3182260 In order to pass the captcha check, you'll probably need to render the page, not just download the source. You might try PhantomJS instead of Selenium + browser. Or, it might render faster in another browser.

Collectives™ on Stack Overflow

Using web driver to get all text from a source page in python

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related