Data Extractor
Combine XPath, CSS Selectors and JSONPath for Web data extracting.
Quickstarts
Installation
Install the stable version from PYPI.
pip install "data-extractor[jsonpath-extractor]" # for extracting JSON data
pip install "data-extractor[lxml]" # for extracting HTML dataOr install the latest version from Github.
pip install "data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master"Extract JSON data
Currently supports to extract JSON data with below optional dependencies
install one dependency of them to extract JSON data.
Extract HTML(XML) data
Currently supports to extract HTML(XML) data with below optional dependencies
- lxml for using XPath
- cssselect for using CSS-Selectors
Usage
from data_extractor import Field, Item, JSONExtractor
class Count(Item):
followings = Field(JSONExtractor("countFollowings"))
fans = Field(JSONExtractor("countFans"))
class User(Item):
name_ = Field(JSONExtractor("name"), name="name")
age = Field(JSONExtractor("age"), default=17)
count = Count()
assert User(JSONExtractor("data.users[*]"), is_many=True).extract(
{
"data": {
"users": [
{
"name": "john",
"age": 19,
"countFollowings": 14,
"countFans": 212,
},
{
"name": "jack",
"description": "",
"countFollowings": 54,
"countFans": 312,
},
]
}
}
) == [
{"name": "john", "age": 19, "count": {"followings": 14, "fans": 212}},
{"name": "jack", "age": 17, "count": {"followings": 54, "fans": 312}},
]Changelog
v0.7.0
- 65d1fce Fix:Create JSONExtractor with wrong subtype
- 407cd78 New:Make lxml and cssselect optional (#61)

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
