datatoinfinity

Posted on Jun 18

Text Preprocessing using Regex - NLP

#nlp #machinelearning #devto #python

What is Preprocessing?

According to google, the initial steps taken to prepare data for analysis or processing by computer.
It involves cleaning, transforming, and organizing raw data into a usable format. This process is crucial for improving data quality, ensuring consistency, and making data more manageable for subsequent tasks like machine learning or data mining.

Now I'll give you my example:

First of all generally you have data in the form row and column.
Whenever you get data there is chance the value is missing, wrong value, data type is different like where value should be integer the value is string.

As you know computer only understand binary language and it is easy to convert number to binary so we convert text to number so that computer can understand.

There will be chances of duplicate value so we need to remove duplicate value.

And all this process is called preprocessing where clean and transform data for better understanding.

Text Preprocessing using Regex

1. Removing Special Character

import re
txt="Hey I$hant &how!!! going$$$?"
print(re.findall('[^!$%5*&?]+',txt))

In bracket [] write the special character but if you print it ['H', 'e', 'y', ' ', 'I', 'h', 'a', 'n', 't', ' ', 'h', 'o', 'w', ' ', 'g', 'o', 'i', 'n', 'g'] like this so it will not happen add '+' after bracket []+.

Output:
['Hey I', 'hant ', 'how', ' going']

As you see 'I$hant' is divide ['Hey I','hant'] solution for this:

import re
txt="Hey I$hant &how!!! going$$$?"
print(''.join(re.findall('[^!$%5*&?]+',txt)))

Output:
Hey Ihant how going

2. Exclusion

As you know to extract digit from text we use '\d' and if it is number we add extra '\d\d'.

import re
txt="I'm 24"
print(re.findall('\d\d',txt))

['24']

If we want to exclude the number and keep the text we use '\D'

import re
txt="It took 24 year to make data to infinity"
print(''.join(re.findall('\D',txt)))

It took  year to make data to infinity

3. Finding Pattern

To find alphanumeric value we use '\w' but here is catch when it will return one character and if increase the '\w\w' it will return character accordingly.

import re
txt="It took 24 year to make data-to-infinity"
print(re.findall('\w\w\w',txt))

['too', 'yea', 'mak', 'dat', 'inf', 'ini']

It return only those character which have at least three character.

Now if I want those character or word which have - hyphen at the end then we will do:

import re
txt="It took 24 year to make data-to-infinity"
print(re.findall('\w\w\w\w-',txt))

['data-']

So 'data' have four character and after that '-' hyphen so write '\w\w\w\w-'

Here is a thing for every word we need '\w' accordingly. How to solve it.

import re
txt="It took 24 year to make data-to-infinity"
print(re.findall('[\w]+',txt))

['It', 'took', '24', 'year', 'to', 'make', 'data', 'to', 'infinity']

Now I want pattern data-to-infinity.

import re
txt="It took 24 year to make data-to-infinity"
print(re.findall('[\w]+-[\w]+-[\w]+',txt))

['data-to-infinity']

import re
for number in ['657-3456-7890','345-789-4567','1234-987-3455']:
    print(re.findall('[\d]+-[\d]+-[\d]',number)[0].replace('-',''))

Now you tell me whats happening?

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.