The Wayback Machine - https://web.archive.org/web/20200522203727/https://github.com/axa-group/Parsr
Skip to content
Transforms PDF, Documents and Images into Enriched Structured Data
TypeScript JavaScript Python Other
Branch: master
Clone or download

Latest commit

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/ISSUE_TEMPLATE [Old Viewer] Linter fixes Oct 30, 2019
.s2i/bin [Old Viewer] Removed more references to web-viewer Oct 30, 2019
.vscode [ocr-credentials] - credentials for OCR services are passed in the JS… Feb 10, 2020
api Merge branch 'develop' into feature/svg-shapes May 5, 2020
assets A higher quality demo gif without the opening gap May 1, 2020
clients/python-client [minor refactor outpuinterpreter] split a function to reduce cognitiv… May 4, 2020
demo [Issue 431] Fix for #431 May 8, 2020
docker [gs] - included PyPDF2 as dependency for pdfmerge Mar 11, 2020
docs [clean api.html] Remove commented code, add captions and scopes attri… May 5, 2020
samples removed fake/garbage files from /samples folder Jan 14, 2020
scripts Merge branch 'develop' into refactor/pdfjs Apr 17, 2020
server Merge branch 'develop' into feature/Abby_Extractor May 6, 2020
test [Tests getDocFromJson] Major refactor: Apr 20, 2020
train [clean refactor train_model] Add modele.js modele_level.js from devel… May 1, 2020
.dockerignore First public release 🚀 Aug 6, 2019
.drone.yml [SonarQube] Fixed wrong url used in slack message Mar 19, 2020
.gitignore [Test Fix] Skip creating directory to save tesseract optimised images… Dec 11, 2019
.prettierrc.js [TSLint] Configure tslint.json & prettierrc.js Oct 30, 2019
.remark-ignore Remark: add ignore list, add skipOffline option Feb 5, 2020
CONTRIBUTING.md Fix every dead links in every markdown files Feb 5, 2020
LICENSE correcting the copyright notice (#81) Sep 26, 2019
README.md [Readme] Disable screen capture as we're waiting one using 'Parsr' pd… May 4, 2020
README_fr.md [Readme] Disable screen capture as we're waiting one using 'Parsr' pd… May 4, 2020
README_pt.md [Readme] Disable screen capture as we're waiting one using 'Parsr' pd… May 4, 2020
README_sp.md [Readme] Disable screen capture as we're waiting one using 'Parsr' pd… May 4, 2020
README_zh-cn.md [Pt Readme] Initial changes for Portuguese Readme Apr 22, 2020
docker-compose-build.yml Run a spell checker on the whole project Nov 1, 2019
docker-compose.yml Correct some problems with the docker-compose (#111) Oct 10, 2019
package-lock.json Update package-lock.json May 6, 2020
package.json [Release] Bump version 0.12.1 May 8, 2020
sonar-project.properties [Drone CI] Removed hook url from sonar file Mar 20, 2020
tsconfig.json modify heading detection to work using a decision tree classifier Nov 26, 2019
tslint.json more linter fixes Apr 8, 2020

README.md


Turn your documents into data!

Français | Portuguese | Spanish | 中文

  • Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.

  • It provides analysis, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.

  • Currently, Parsr can perform document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, ToCs, page numbers, headers/footers, links, and others. Check out all the features.

Table of Contents

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

  1. To access the python client to Parsr API, issue:

    pip install parsr-client

    To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.

  1. To use the GUI tool (the API needs to already be running), issue:
    docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
    Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

  1. QPDF: Apache http://qpdf.sourceforge.net
  2. ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
  3. Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
  4. PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
  5. Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
  6. Camelot: MIT https://github.com/camelot-dev/camelot
  7. MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
  8. Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Copyright 2020 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).

You can’t perform that action at this time.