Short Text Pre-processing

Question

For educational purpose I am preprocessing multiple short texts containing the description of the symptoms of cars fault. The text is written by humans and is rich in misspelling, capital letters and other stuff.

I wanted to write a short pre-processing function and I have three questions:

Why I get two different results based on how I format the re.escape() (the first one is the correct piece of code)
Can I adapt to f-string formatting in this section re.compile('[%s]' % ?re.escape(string.punctuation)).sub(' ', text)

There is any way I improve readability and performance of this code?

example = "This, is just an example! Nothing serious :) "

#convert to lowercase, strip and remove punctuations
def preprocess(text):
     """convert to lowercase, strip and remove punctuations"""
       text = text.lower() 
       text=text.strip()  
       text=re.compile('<.*?>').sub('', text) 
       text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  
       text = re.sub('\s+', ' ', text)  
       text = re.sub(r'\[[0-9]*\]',' ',text) 
       text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
       text = re.sub(r'\d',' ',text) 
       text = re.sub(r'\s+',' ',text) 
       return text

The wrong one:

   #convert to lowercase, strip and remove punctuations
   def preprocess(text):
        """convert to lowercase, strip and remove punctuations"""
          text = text.lower() 
          text=text.strip()  
          text=re.compile('<.*?>').sub('', text) 
          escaping = re.escape(string.punctuation)
          test = re.compile('[{}s]'.format(escaping)).sub(' ',text)
          text = re.sub('\s+', ' ', text)  
          text = re.sub(r'\[[0-9]*\]',' ',text) 
          text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
          text = re.sub(r'\d',' ',text) 
          text = re.sub(r'\s+',' ',text) 
          return text

string.punctuation leads to NameError: name 'string' is not defined. Do I miss some imports? — JosefZ
– JosefZ, Commented Dec 17, 2022 at 20:28
I'm afraid this question does not match what this site is about. Code Review is about improving existing, working code. Code Review is not the site to ask for help in fixing or changing what your code does. Anyway, try print(preprocess(example + string.punctuation)) to see what's wrong in the 2nd output… — JosefZ
– JosefZ, Commented Dec 19, 2022 at 18:54
Probably I shared in the wrong way my question, but Point 2 and 3 are for Code Review because the code works :) I didn't share all the libraries to be more concise, but I can add them. On the point 1 I agree it is off-topic and more suitable for Stack Overflow — Andrea Ciufo
– Andrea Ciufo, Commented Dec 30, 2022 at 18:54

Polar Shift · Accepted Answer · 2022-12-23 06:09:41Z

To be PEP-8 compliant, you may wish to review your spacing. Specifically, text=text.strip() into text = text.strip() with spaces surrounding the assignment operator. This is done in some locations within your code, but not others - I would recommend consistency.

Some parts of your code are redundant - in this statement text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text) you are removing square brackets (and additional characters). In a following line text = re.sub(r'\[[0-9]*\]',' ',text) you are removing digits which are surrounded by square brackets. But since you have already removed square brackets, it will never find anything which matches this condition!

Also, be aware that \ is an escape character. When you wish to use it as itself within a regular expression, it must be escaped itself \\ or a raw string must be used. This occurs in this line of code: text = re.sub('\s+', ' ', text)

re.compile() followed by .sub() could just be re.sub(). You are not saving the compiled regular expression to use again.

Characters can be replaced with spaces - if your text ended in a number, it would be replaced by a space at the end of your string. You want text = text.strip() to be one of the last things your code does.

text=re.sub(r'[^\w\s]', '', str(text).lower().strip()) has redundancy - you already converted everything to lowercase, so you do not need another .lower() here. Your variable, text, is already a string, and so str(text) is converting it unnecessarily. As mentioned, you want .strip() at the end - if doing so, the one in this block of code is not needed.

You should use type hints: def preprocess(text: str) -> str: to document that the function takes type string, and returns type string.

Reworked code:

import string
import re

def preprocess(text: str) -> str:
    """convert to lowercase, strip and remove punctuations"""

    text = text.lower()
    text = re.sub('<.*?>', '', text)
    text = re.sub(f'[{re.escape(string.punctuation)}]', ' ', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'[\d\s]+', ' ', text)
    text = text.strip()

    return text

Stack Exchange Network

Short Text Pre-processing

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Short Text Pre-processing

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions