The most widely used NLP library in the enterprise
Backed by O'Reilly's most recent "AI Adoption in the Enterprise" survey in February
100% Open Source
Including pre-trained models and pipelines
Natively scalable
The only NLP library built natively on Apache Spark
Multiple Languages
Full Python, Scala, and Java support
Quick and Easy
Spark NLP is available on PyPI, Conda, Maven, and Spark Packages
# Install Spark NLP from PyPI
$ pip install spark-nlp==2.5.1
# Install Spark NLP from Anaconda/Conda
$ conda install -c johnsnowlabs spark-nlp
# Load Spark NLP with Spark Shell
$ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1
# Load Spark NLP with PySpark
$ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1
# Load Spark NLP with Spark Submit
$ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1
# Load Spark NLP as external JAR after comiling and bulding Spark NLP by `sbt assembly`
$ spark-shell --jar spark-nlp-assembly-2.5.1
Right Out of The Box
Spark NLP ships with many NLP features, pre-trained models and pipelines
NLP Features
- Tokenization
- Stop Words Removal
- Normalizer
- Stemmer
- Lemmatizer
- NGrams
- Regex Matching
- Text Matching
- Chunking
- Date Matcher
- Part-of-speech tagging
- Sentence Detector
- Dependency parsing (Labeled/unlabled)
- Sentiment Detection (ML models)
- Spell Checker (ML and DL models)
- Word Embeddings (GloVe and Word2Vec)
- BERT Embeddings
- ELMO Embeddings
- Universal Sentence Encoder
- Sentence Embeddings
- Chunk Embeddings
- Multi-class Text Classification (DL model)
- Multi-class Sentiment Analysis (DL model)
- Named entity recognition (DL model)
- Easy TensorFlow integration
- Full integration with Spark ML functions
- +90 pre-trained models in 21 languages!
- +70 pre-trained pipelines in 10 languages!
# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
# Start Spark Session with Spark NLP
spark = sparknlp.start()
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en')
# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo.
It's held at the Louvre in Paris.
"""
# Annotate your testing dataset
result = pipeline.annotate(text)
# What's in the pipeline
list(result.keys())
Output: ['entities', 'stem', 'checked', 'lemma', 'document',
'pos', 'token', 'ner', 'embeddings', 'sentence']
# Check the results
result['entities']
Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']
Benchmark
Spark NLP 2.5.x obtained the best performing academic peer-reviewed results
Training NER
- State-of-the-art Deep Learning algorithms
- Achieve high accuracy within a few minutes
- Achieve high accuracy with a few lines of codes
- Blazing fast training
- Use CPU or GPU
- Easy to choose Word Embeddings
- Pre-trained GloVe models
- Pre-trained BERT models (TF Hub)
- Pre-trained ELMO models (TF Hub)
- Pre-trained ALBERT models (TF Hub)
- Pre-trained XLNet models
- Multi-lingual NER models in Dutch, English, French, German, Italian, Norwegian, Polish, Portuguese, Russian, Spanish
| SYSTEM | YEAR | LANGUAGE | ACCURACY |
|---|---|---|---|
| Spark NLP v2.4 | 2020 | Python/Scala/Java/R | 93.3 (test F1) - 95.9 (dev F1) |
| Spark NLP v2.x | 2019 | Python/Scala/Java/R | 93 |
| Spark NLP v1.x | 2018 | Python/Scala/Java/R | 92 |
| spaCy v2.x | 2017 |
Python/Cython | 92.6 |
| spaCy v1.x | 2015 | Python/Cython | 91.8 |
| ClearNLP | 2015 | Java | 91.7 |
| CoreNLP | 2015 | Java | 89.6 |
| MATE | 2015 | Java | 92.5 |
| Turbo | 2015 | C++ | 92.4 |








