Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.
It provides analysis, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.
Currently, Parsr can perform document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, ToCs, page numbers, headers/footers, links, and others. Check out all the features.

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

To access the python client to Parsr API, issue:
```
pip install parsr-client
```
To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.

To use the GUI tool (the API needs to already be running), issue:
```
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
```
Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

QPDF: Apache http://qpdf.sourceforge.net
ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
Camelot: MIT https://github.com/camelot-dev/camelot
MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Copyright 2020 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).

Apr	MAY	Jun
	22
2019	2020	2021

Name	Latest commit message	Commit time
Failed to load latest commit information.
.github/ISSUE_TEMPLATE	[Old Viewer] Linter fixes	Oct 30, 2019
.s2i/bin	[Old Viewer] Removed more references to web-viewer	Oct 30, 2019
.vscode	[ocr-credentials] - credentials for OCR services are passed in the JS…	Feb 10, 2020
api	Merge branch 'develop' into feature/svg-shapes	May 5, 2020
assets	A higher quality demo gif without the opening gap	May 1, 2020
clients/python-client	[minor refactor outpuinterpreter] split a function to reduce cognitiv…	May 4, 2020
demo	[Issue 431] Fix for #431	May 8, 2020
docker	[gs] - included PyPDF2 as dependency for pdfmerge	Mar 11, 2020
docs	[clean api.html] Remove commented code, add captions and scopes attri…	May 5, 2020
samples	removed fake/garbage files from /samples folder	Jan 14, 2020
scripts	Merge branch 'develop' into refactor/pdfjs	Apr 17, 2020
server	Merge branch 'develop' into feature/Abby_Extractor	May 6, 2020
test	[Tests getDocFromJson] Major refactor:	Apr 20, 2020
train	[clean refactor train_model] Add modele.js modele_level.js from devel…	May 1, 2020
.dockerignore	First public release 🚀	Aug 6, 2019
.drone.yml	[SonarQube] Fixed wrong url used in slack message	Mar 19, 2020
.gitignore	[Test Fix] Skip creating directory to save tesseract optimised images…	Dec 11, 2019
.prettierrc.js	[TSLint] Configure tslint.json & prettierrc.js	Oct 30, 2019
.remark-ignore	Remark: add ignore list, add skipOffline option	Feb 5, 2020
CONTRIBUTING.md	Fix every dead links in every markdown files	Feb 5, 2020
LICENSE	correcting the copyright notice (#81 )	Sep 26, 2019
README.md	[Readme] Disable screen capture as we're waiting one using 'Parsr' pd…	May 4, 2020
README_fr.md	[Readme] Disable screen capture as we're waiting one using 'Parsr' pd…	May 4, 2020
README_pt.md	[Readme] Disable screen capture as we're waiting one using 'Parsr' pd…	May 4, 2020
README_sp.md	[Readme] Disable screen capture as we're waiting one using 'Parsr' pd…	May 4, 2020
README_zh-cn.md	[Pt Readme] Initial changes for Portuguese Readme	Apr 22, 2020
docker-compose-build.yml	Run a spell checker on the whole project	Nov 1, 2019
docker-compose.yml	Correct some problems with the docker-compose (#111 )	Oct 10, 2019
package-lock.json	Update package-lock.json	May 6, 2020
package.json	[Release] Bump version 0.12.1	May 8, 2020
sonar-project.properties	[Drone CI] Removed hook url from sonar file	Mar 20, 2020
tsconfig.json	modify heading detection to work using a decision tree classifier	Nov 26, 2019
tslint.json	more linter fixes	Apr 8, 2020

axa-group / Parsr

README.md

Turn your documents into data!

Table of Contents

Getting Started

Installation

Usage

Documentation

Contribute

Third Party Licenses

License

axa-group / Parsr

Join GitHub today

Clone with HTTPS

Downloading

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio

Latest commit

Files

README.md

Turn your documents into data!

Table of Contents

Getting Started

Installation

Usage

Documentation

Contribute

Third Party Licenses

License