A simple way to structure your web scraper.
- Define the request.
- Extract the data from the response.
- Validate the data against JSON Schema.
install
Using NPM:
npm install yolo-scraper --saveusage
Define your scraper function.
var yoloScraper = require('yolo-scraper');
var scraper = yoloScraper.createScraper({
request: function (username) {
return 'https://www.npmjs.com/~' + username.toLowerCase();
},
extract: function (response, body, $) {
return $('.collaborated-packages li').toArray().map(function (element) {
var $element = $(element);
return {
name: $element.find('a').text(),
url: $element.find('a').attr('href'),
version: $element.find('strong').text()
};
});
},
schema: {
"$schema": "http://json-schema.org/draft-04/schema#",
"type" : "array",
"items": {
"type": "object",
"additionalProperties": false,
"properties": {
"name": { "type": "string" },
"url": { "type": "string", "format": "uri" },
"version": { "type": "string", "pattern": "^v\\d+\\.\\d+\\.\\d+$" }
},
"required": [ "name", "url", "version" ]
}
}
});Then use it.
scraper('masterT', function (error, data) {
console.log(error || data);
});documentation
ValidationError
Error instance with additional Object property errorObjects which content all the error information, see ajv error.
ListValidationError
Error instance with additional Array property validationError of ValidationError instance.
createScraper(options)
Returned a scraper function defined by the options.
var yoloScraper = require('yolo-scraper');
var options = {
// ...
};
var scraper = yoloScraper.createScraper(options);options.paramsSchema
The JSON schema that defines the shape of the accepted arguments passed to options.request. When invalid, an Error will be thrown.
Optional
options.request = function(params)
Function that takes the arguments passed to your scraper function and returns the options to pass to the request module to make the network request.
Required
options.extract = function(response, body, $)
Function that takes request response, the response body (String) and a cheerio instance. It returns the extracted data you want.
Required
options.schema
The JSON schema that defines the shape of your extracted data. When your data is invalid, an Error with the validation message will be passed to your scraper callback.
Required
options.cheerioOptions
The option to pass to cheerio when it loads the request body.
Optional, default: {}
options.ajvOptions
The option to pass to ajv when it compiles the JSON schemas.
Optional, default: {allErrors: true} - It check all rules collecting all errors
options.validateList
Use this option to validate each item of the data extracted individually. When true, the data extracted is required to be an Array, otherwise an Error is returned in callback.
Optional, default: false
scraper function
To use your scraper function, pass the params to send to options.request, and a callback function.
scraper(params, function (error, data) {
if (error) {
// handle the `error`
} else {
// do something with `data`
}
});callback(error, data)
-
When a network request error occurred, the callback
errorargument will be an Error and thedatawill be null. -
When
options.validateList = falseand a validation error occurred,errorwill be a ValidationError and thedatawill be null. Otherwise, theerrorwill be null anddatawill be the returned value ofoptions.extract. -
When
options.validateList = trueand a validation errors occurred,errorwill be a ListValidationError, otherwise it will be null. If the value returned byoptions.extractis not an Array,errorwill be an instance of Error. Thedataalways be an Array that only contains the valid item returned byoptions.extract. It's not becauseerroris a ListValidationError that there will be nodata!
dependecies
- request - Simplified HTTP request client.
- cheerio - Tiny, fast, and elegant implementation of core jQuery designed specifically for the server.
- ajv - Another JSON Schema Validator.
dev dependecies
test
npm testlicense
MIT

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

