masterT / yolo-scraper

A simple way to structure your web scraper.

Define the request.
Extract the data from the response.
Validate the data against JSON Schema.

install

Using NPM:

npm install yolo-scraper --save

usage

Define your scraper function.

var yoloScraper = require('yolo-scraper');

var scraper = yoloScraper.createScraper({

  request: function (username) {
    return 'https://www.npmjs.com/~' + username.toLowerCase();
  },

  extract: function (response, body, $) {
    return $('.collaborated-packages li').toArray().map(function (element) {
      var $element = $(element);
      return {
        name: $element.find('a').text(),
        url: $element.find('a').attr('href'),
        version: $element.find('strong').text()
      };
    });
  },

  schema: {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type" : "array",
    "items": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "name": { "type": "string" },
        "url": { "type": "string", "format": "uri" },
        "version": { "type": "string", "pattern": "^v\\d+\\.\\d+\\.\\d+$" }
      },
      "required": [ "name", "url", "version" ]
    }
  }

});

Then use it.

scraper('masterT', function (error, data) {
  console.log(error || data);
});

documentation

`ValidationError`

Error instance with additional Object property errorObjects which content all the error information, see ajv error.

`ListValidationError`

Error instance with additional Array property validationError of ValidationError instance.

`createScraper(options)`

Returned a scraper function defined by the options.

var yoloScraper = require('yolo-scraper');

var options = {
  // ...
};
var scraper = yoloScraper.createScraper(options);

`options.paramsSchema`

The JSON schema that defines the shape of the accepted arguments passed to options.request. When invalid, an Error will be thrown.

Optional

`options.request = function(params)`

Function that takes the arguments passed to your scraper function and returns the options to pass to the request module to make the network request.

Required

`options.extract = function(response, body, $)`

Function that takes request response, the response body (String) and a cheerio instance. It returns the extracted data you want.

Required

`options.schema`

The JSON schema that defines the shape of your extracted data. When your data is invalid, an Error with the validation message will be passed to your scraper callback.

Required

`options.cheerioOptions`

The option to pass to cheerio when it loads the request body.

Optional, default: {}

`options.ajvOptions`

The option to pass to ajv when it compiles the JSON schemas.

Optional, default: {allErrors: true} - It check all rules collecting all errors

`options.validateList`

Use this option to validate each item of the data extracted individually. When true, the data extracted is required to be an Array, otherwise an Error is returned in callback.

Optional, default: false

scraper function

To use your scraper function, pass the params to send to options.request, and a callback function.

scraper(params, function (error, data) {
  if (error) {
    // handle the `error`
  } else {
    // do something with `data`
  }
});

callback(error, data)

When a network request error occurred, the callback error argument will be an Error and the data will be null.
When options.validateList = false and a validation error occurred, error will be a ValidationError and the data will be null. Otherwise, the error will be null and data will be the returned value of options.extract.
When options.validateList = true and a validation errors occurred, error will be a ListValidationError, otherwise it will be null. If the value returned by options.extract is not an Array, error will be an instance of Error. The data always be an Array that only contains the valid item returned by options.extract. It's not because error is a ListValidationError that there will be no data!

dependecies

request - Simplified HTTP request client.
cheerio - Tiny, fast, and elegant implementation of core jQuery designed specifically for the server.
ajv - Another JSON Schema Validator.

dev dependecies

jasmine - DOM-less simple JavaScript testing framework.
nock HTTP Server mocking for Node.js.

test

npm test

license

MIT

Aug	SEP	Oct
	21
2019	2020	2021

masterT / yolo-scraper

README.md

install

usage

documentation

`ValidationError`

`ListValidationError`

`createScraper(options)`

`options.paramsSchema`

`options.request = function(params)`

`options.extract = function(response, body, $)`

`options.schema`

`options.cheerioOptions`

`options.ajvOptions`

`options.validateList`

scraper function

callback(error, data)

dependecies

dev dependecies

test

license

About

Releases 2

Packages

Contributors 2

Languages

masterT / yolo-scraper

Join GitHub today

Clone with HTTPS

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio

Latest commit

Git stats

Files

README.md

install

usage

documentation

ValidationError

ListValidationError

createScraper(options)

options.paramsSchema

options.request = function(params)

options.extract = function(response, body, $)

options.schema

options.cheerioOptions

options.ajvOptions

options.validateList

scraper function

callback(error, data)

dependecies

dev dependecies

test

license

About

Topics

Resources

License

Releases 2

Packages 0

Contributors 2

Languages

`ValidationError`

`ListValidationError`

`createScraper(options)`

`options.paramsSchema`

`options.request = function(params)`

`options.extract = function(response, body, $)`

`options.schema`

`options.cheerioOptions`

`options.ajvOptions`

`options.validateList`

Packages