The Wayback Machine - https://web.archive.org/web/20200921023758/https://github.com/masterT/yolo-scraper
Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

yolo-scraper

A simple way to structure your web scraper.

npm version Build Status JavaScript Style Guide

  • Define the request.
  • Extract the data from the response.
  • Validate the data against JSON Schema.

But what is web scraping?

install

Using NPM:

npm install yolo-scraper --save

usage

Define your scraper function.

var yoloScraper = require('yolo-scraper');

var scraper = yoloScraper.createScraper({

  request: function (username) {
    return 'https://www.npmjs.com/~' + username.toLowerCase();
  },

  extract: function (response, body, $) {
    return $('.collaborated-packages li').toArray().map(function (element) {
      var $element = $(element);
      return {
        name: $element.find('a').text(),
        url: $element.find('a').attr('href'),
        version: $element.find('strong').text()
      };
    });
  },

  schema: {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type" : "array",
    "items": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "name": { "type": "string" },
        "url": { "type": "string", "format": "uri" },
        "version": { "type": "string", "pattern": "^v\\d+\\.\\d+\\.\\d+$" }
      },
      "required": [ "name", "url", "version" ]
    }
  }

});

Then use it.

scraper('masterT', function (error, data) {
  console.log(error || data);
});

documentation

ValidationError

Error instance with additional Object property errorObjects which content all the error information, see ajv error.

ListValidationError

Error instance with additional Array property validationError of ValidationError instance.

createScraper(options)

Returned a scraper function defined by the options.

var yoloScraper = require('yolo-scraper');

var options = {
  // ...
};
var scraper = yoloScraper.createScraper(options);

options.paramsSchema

The JSON schema that defines the shape of the accepted arguments passed to options.request. When invalid, an Error will be thrown.

Optional

options.request = function(params)

Function that takes the arguments passed to your scraper function and returns the options to pass to the request module to make the network request.

Required

options.extract = function(response, body, $)

Function that takes request response, the response body (String) and a cheerio instance. It returns the extracted data you want.

Required

options.schema

The JSON schema that defines the shape of your extracted data. When your data is invalid, an Error with the validation message will be passed to your scraper callback.

Required

options.cheerioOptions

The option to pass to cheerio when it loads the request body.

Optional, default: {}

options.ajvOptions

The option to pass to ajv when it compiles the JSON schemas.

Optional, default: {allErrors: true} - It check all rules collecting all errors

options.validateList

Use this option to validate each item of the data extracted individually. When true, the data extracted is required to be an Array, otherwise an Error is returned in callback.

Optional, default: false

scraper function

To use your scraper function, pass the params to send to options.request, and a callback function.

scraper(params, function (error, data) {
  if (error) {
    // handle the `error`
  } else {
    // do something with `data`
  }
});
callback(error, data)
  • When a network request error occurred, the callback error argument will be an Error and the data will be null.

  • When options.validateList = false and a validation error occurred, error will be a ValidationError and the data will be null. Otherwise, the error will be null and data will be the returned value of options.extract.

  • When options.validateList = true and a validation errors occurred, error will be a ListValidationError, otherwise it will be null. If the value returned by options.extract is not an Array, error will be an instance of Error. The data always be an Array that only contains the valid item returned by options.extract. It's not because error is a ListValidationError that there will be no data!

dependecies

  • request - Simplified HTTP request client.
  • cheerio - Tiny, fast, and elegant implementation of core jQuery designed specifically for the server.
  • ajv - Another JSON Schema Validator.

dev dependecies

  • jasmine - DOM-less simple JavaScript testing framework.
  • nock HTTP Server mocking for Node.js.

test

npm test

license

MIT

You can’t perform that action at this time.