DEV Community

Cover image for Stop Words Using Spacy - NLP
datatoinfinity
datatoinfinity

Posted on

Stop Words Using Spacy - NLP

As you all aware about stop words and we have done it with nltk now we are doing it with spacy.

We are going to remove stop word with help of spacy package.

import spacy
nlp=spacy.load('en_core_web_sm')
print(nlp.Defaults.stop_words)
print()
print(len(nlp.Defaults.stop_words))
{'‘ve', 'below', 'anywhere', 'then', 'nowhere', 'around', 'been', 'whence', 'next', 'go', 'therein', 'while', 'became', 'also', 'whereafter', 'whole', 'about', 'regarding', 'who', 'by', 'ever', 'was', 'cannot', 'throughout', 'the', 'latter', 'please', 'although', "'ve", 'though', 'still', 'doing', 'beyond', 'sometimes', 'on', 'whereas', 'see', 'moreover', 'top', 'within', 'along', "n't", 'herein', 'hereby', 'them', 'always', 'themselves', 'why', 'someone', 'toward', 'should', 'her', 'n‘t', 'this', 'across', 'anyway', 'at', 'various', 'now', 'empty', 'rather', 'until', 'ten', 'mostly', 'already', 'one', 'amount', 'becoming', 'our', 'namely', 'fifty', 'with', 'just', 'two', 'wherever', 'from', 'your', "'ll", 'front', 'amongst', 'him', 'have', 'nine', 'might', 'no', '‘s', 'almost', 'own', 'name', 'too', 'four', 'which', 'yourselves', 'yourself', 'whose', 'behind', 'seem', 'afterwards', 'something', 'his', 'not', 'did', 'had', 'itself', 'enough', 'or', 'i', 'such', 'eight', 'least', "'d", 'unless', 'if', '’m', 'formerly', 'whatever', 'all', 'indeed', 'those', 'wherein', 'sixty', "'re", 'whenever', 'are', 'there', 'through', 're', 'against', 'being', 'show', 'keep', 'off', 'could', 'otherwise', 'therefore', 'does', 'a', 'either', 'give', 'most', 'thus', 'that', 'however', 'us', 'ourselves', 'their', 'get', 'former', 'he', 'it', 'anyone', 'thereby', 'call', 'per', '‘d', 'put', 'sometime', 'five', 'n’t', 'you', 'because', 'done', 'any', 'more', 'hers', '’s', 'above', 'back', 'both', 'nor', 'mine', 'than', 'these', 'beforehand', 'we', 'anything', 'further', "'m", 'seeming', 'many', 'nevertheless', 'becomes', 'six', 'never', 'alone', 'quite', 'take', 'neither', 'whereby', '‘m', 'whither', 'will', 'myself', 'hereupon', 'become', 'so', 'my', 'last', 'whereupon', 'latterly', 'since', 'were', 'they', 'whether', 'of', 'how', 'using', 'thereupon', 'herself', 'an', 'third', 'yet', 'ours', 'before', 'am', 'onto', 'here', 'noone', 'often', 'every', 'she', 'seemed', 'side', 'really', '‘ll', 'bottom', 'between', 'may', 'out', 'would', 'during', '‘re', 'towards', 'where', 'to', 'without', 'somehow', 'whoever', 'thence', 'down', 'is', 'seems', '’re', 'upon', 'very', 'everywhere', 'thru', 'each', 'twelve', 'via', 'somewhere', 'up', 'has', 'as', 'others', 'few', 'together', 'made', 'for', 'hereafter', 'else', 'other', 'hundred', 'can', 'yours', 'in', 'nobody', 'used', 'be', 'do', 'me', 'three', 'hence', 'must', 'nothing', 'and', 'but', 'serious', 'under', 'make', 'anyhow', 'meanwhile', 'much', 'even', 'fifteen', 'himself', 'some', 'what', 'into', 'thereafter', 'due', 'well', 'everyone', 'first', 'full', 'perhaps', 'over', 'several', "'s", 'everything', 'whom', 'again', 'elsewhere', '’d', 'forty', '’ll', 'less', 'none', 'except', 'say', 'once', 'besides', 'move', 'when', 'ca', 'its', 'after', 'another', 'same', 'beside', 'only', 'eleven', 'twenty', '’ve', 'among', 'part'}

326

These are all stop words in spacy package and there are 326 stop words in this package.

Check if the word is Stop Word.

import spacy
nlp=spacy.load('en_core_web_sm')
print(nlp.vocab['How'].is_stop)
print(nlp.vocab['are'].is_stop)
print(nlp.vocab['you'].is_stop)
print(nlp.vocab['Data'].is_stop)
Output:
True
True
True
False

'True' for stop words and 'False' for words which are not stop words.

Stop Words of our own.

print(nlp.vocab['i.e'].is_stop)  
nlp.vocab['i.e'].is_stop=True 
print(nlp.vocab['i.e'].is_stop)
Output:
False
True

First i.e is giving False for first output because we haven't added into our stop words. After adding to stop words it is giving output as False.

Let's Find Stop Word from corpus.

txt='''Data science is the study of data. Like biological sciences is a study of biology, physical sciences, it's the study of physical reactions. Data is real, data has real properties, and we need to study them if we're going to work on them. Data Science involves data and some signs. It is a process, not an event. It is the process of using data to understand too many different things, to understand the world. Let Suppose when you have a model or proposed explanation of a problem, and you try to validate that proposed explanation or model with your data. It is the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It's when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution. We can also define data science as a field that is about processes and systems to extract data of various forms and from various resources whether the data is unstructured or structured.
The definition and the name came up in the 1980s and 1990s when some professors, IT Professionals, scientists were looking into the statistics curriculum, and they thought it would be better to call it data science and then later on data analytics derived.
'''
txt=txt.replace('\n','')
txt=txt.replace('  ','')
txt=txt.strip()

doc = nlp(txt)

stop_words = set()

for token in doc:
    if token.is_stop:
        stop_words.add(token.text)

print(stop_words)
print(len(stop_words))
Output:
{'they', 'The', "'s", 'can', 'about', 'an', 'many', 'on', 'call', 'some', 'would', 'if', 'when', 'up', 'and', 'too', 'the', 'not', 'various', 'in', 'you', 'make', 'with', 'using', 'or', 'for', 'of', 'we', 'were', 'to', 'as', 'name', 'IT', 'And', 'also', 'whether', 'your', 'it', "'re", 'into', 'these', 'has', 'them', 'is', 'It', 'have', 'a', 'that', 'be', 'from', 'behind', 'then', 'are', 'So', 'We'}

55

Now print the corpus without stop words.

print(' '.join([token.text for token in doc if not token.is_stop]))
Data science study data . Like biological sciences study biology , physical sciences , study physical reactions . Data real , data real properties , need study going work . Data Science involves data signs . process , event . process data understand different things , understand world . Let Suppose model proposed explanation problem , try validate proposed explanation model data . skill unfolding insights trends hiding ( abstract ) data . translate data story . use storytelling generate insight . insights , strategic choices company institution . define data science field processes systems extract data forms resources data unstructured structured . definition came 1980s 1990s professors , Professionals , scientists looking statistics curriculum , thought better data science later data analytics derived .

Top comments (0)