Web Scraping Framework
Project description
What is Grab?
Grab is a python web scraping framework. Grab provides a number of helpful methods to perform network requests, scrape web sites and process the scraped content:
- Automatic cookies (session) support
- HTTP and SOCKS proxy with/without authorization
- Keep-Alive support
- IDN support
- Tools to work with web forms
- Easy multipart file uploading
- Flexible customization of HTTP requests
- Automatic charset detection
- Powerful API to extract data from DOM tree of HTML documents with XPATH queries
- Asynchronous API to make thousands of simultaneous queries. This part of library called Spider. See list of spider fetures below.
- Python 3 ready
Spider is a framework for writing web-site scrapers. Features:
- Rules and conventions to organize the request/parse logic in separate blocks of codes
- Multiple parallel network requests
- Automatic processing of network errors (failed tasks go back to task queue)
- You can create network requests and parse responses with Grab API (see above)
- HTTP proxy support
- Caching network results in permanent storage
- Different backends for task queue (in-memory, redis, mongodb)
- Tools to debug and collect statistics
Grab Example
import logging from grab import Grab logging.basicConfig(level=logging.DEBUG) g = Grab() g.go('https://github.com/login') g.doc.set_input('login', '****') g.doc.set_input('password', '****') g.doc.submit() g.doc.save('/tmp/x.html') g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists() home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text() repo_url = home_url + '?tab=repositories' g.go(repo_url) for elem in g.doc.select('//h3[@class="repo-list-name"]/a'): print('%s: %s' % (elem.text(), g.make_url_absolute(elem.attr('href'))))
Grab::Spider Example
import logging from grab.spider import Spider, Task logging.basicConfig(level=logging.DEBUG) class ExampleSpider(Spider): def task_generator(self): for lang in 'python', 'ruby', 'perl': url = 'https://www.google.com/search?q=%s' % lang yield Task('search', url=url, lang=lang) def task_search(self, grab, task): print('%s: %s' % (task.lang, grab.doc('//div[@class="s"]//cite').text())) bot = ExampleSpider(thread_number=2) bot.run()
Installation
$ pip install -U grab
See details about installing Grab on different platforms here http://docs.grablib.org/en/latest/usage/installation.html
Documentation and Help
Documentation: http://docs.grablib.org/en/latest/
Mailing list (mostly russian): http://groups.google.com/group/python-grab/
Contribution
To report a bug please use GitHub issue tracker: https://github.com/lorien/grab/issues
If you want to develop new feature in Grab please use issue tracker to describe what you want to do or contact me at lorien@lorien.name
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
| Filename, size | File type | Python version | Upload date | Hashes |
|---|---|---|---|---|
| Filename, size grab-0.6.41.tar.gz (1.1 MB) | File type Source | Python version None | Upload date | Hashes View |

