The most widely used NLP library in the enterprise

Backed by O'Reilly's most recent "AI Adoption in the Enterprise" survey in February

100% Open Source

Including pre-trained models and pipelines

Natively scalable

The only NLP library built natively on Apache Spark

Multiple Languages

Full Python, Scala, and Java support

Quick and Easy

Spark NLP is available on PyPI, Conda, Maven, and Spark Packages

# Install Spark NLP from PyPI
$ pip install spark-nlp==2.5.1

# Install Spark NLP from Anaconda/Conda
$ conda install -c johnsnowlabs spark-nlp

# Load Spark NLP with Spark Shell
$ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1

# Load Spark NLP with PySpark
$ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1

# Load Spark NLP with Spark Submit
$ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1

# Load Spark NLP as external JAR after comiling and bulding Spark NLP by `sbt assembly`
$ spark-shell --jar spark-nlp-assembly-2.5.1

Right Out of The Box

Spark NLP ships with many NLP features, pre-trained models and pipelines

NLP Features

Tokenization
Stop Words Removal
Normalizer
Stemmer
Lemmatizer
NGrams
Regex Matching
Text Matching
Chunking
Date Matcher
Part-of-speech tagging
Sentence Detector
Dependency parsing (Labeled/unlabled)
Sentiment Detection (ML models)
Spell Checker (ML and DL models)
Word Embeddings (GloVe and Word2Vec)
BERT Embeddings
ELMO Embeddings
Universal Sentence Encoder
Sentence Embeddings
Chunk Embeddings
Multi-class Text Classification (DL model)
Multi-class Sentiment Analysis (DL model)
Named entity recognition (DL model)
Easy TensorFlow integration
Full integration with Spark ML functions
+90 pre-trained models in 21 languages!
+70 pre-trained pipelines in 10 languages!

# Import Spark NLP            
from sparknlp.base import *
from sparknlp.annotator import *

from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark Session with Spark NLP
spark = sparknlp.start()

# Download a pre-trained pipeline 
pipeline = PretrainedPipeline('explain_document_dl', lang='en')

# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo. 
It's held at the Louvre in Paris.
"""

# Annotate your testing dataset
result = pipeline.annotate(text)

# What's in the pipeline
list(result.keys())
Output: ['entities', 'stem', 'checked', 'lemma', 'document',
'pos', 'token', 'ner', 'embeddings', 'sentence']

# Check the results
result['entities']
Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']

Benchmark

Spark NLP 2.5.x obtained the best performing academic peer-reviewed results

Training NER

State-of-the-art Deep Learning algorithms
Achieve high accuracy within a few minutes
Achieve high accuracy with a few lines of codes
Blazing fast training
Use CPU or GPU
Easy to choose Word Embeddings
Pre-trained GloVe models
Pre-trained BERT models (TF Hub)
Pre-trained ELMO models (TF Hub)
Pre-trained ALBERT models (TF Hub)
Pre-trained XLNet models
Multi-lingual NER models in Dutch, English, French, German, Italian, Norwegian, Polish, Portuguese, Russian, Spanish

SYSTEM	YEAR	LANGUAGE	ACCURACY
Spark NLP v2.4	2020	Python/Scala/Java/R	93.3 (test F1) - 95.9 (dev F1)
Spark NLP v2.x	2019	Python/Scala/Java/R	93
Spark NLP v1.x	2018	Python/Scala/Java/R	92
spaCy v2.x	2017	Python/Cython	92.6
spaCy v1.x	2015	Python/Cython	91.8
ClearNLP	2015	Java	91.7
CoreNLP	2015	Java	89.6
MATE	2015	Java	92.5
Turbo	2015	C++	92.4

May	JUN	Jul
	05
2019	2020	2021

Spark NLP: State of the Art
Natural Language Processing