For educational purpose I am preprocessing multiple short texts containing the description of the symptoms of cars fault. The text is written by humans and is rich in misspelling, capital letters and other stuff.
I wanted to write a short pre-processing function and I have three questions:
Why I get two different results based on how I format the
re.escape()(the first one is the correct piece of code)Can I adapt to f-string formatting in this section
re.compile('[%s]' % ?re.escape(string.punctuation)).sub(' ', text)There is any way I improve readability and performance of this code?
example = "This, is just an example! Nothing serious :) " #convert to lowercase, strip and remove punctuations def preprocess(text): """convert to lowercase, strip and remove punctuations""" text = text.lower() text=text.strip() text=re.compile('<.*?>').sub('', text) text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text) text = re.sub('\s+', ' ', text) text = re.sub(r'\[[0-9]*\]',' ',text) text=re.sub(r'[^\w\s]', '', str(text).lower().strip()) text = re.sub(r'\d',' ',text) text = re.sub(r'\s+',' ',text) return text
The wrong one:
#convert to lowercase, strip and remove punctuations
def preprocess(text):
"""convert to lowercase, strip and remove punctuations"""
text = text.lower()
text=text.strip()
text=re.compile('<.*?>').sub('', text)
escaping = re.escape(string.punctuation)
test = re.compile('[{}s]'.format(escaping)).sub(' ',text)
text = re.sub('\s+', ' ', text)
text = re.sub(r'\[[0-9]*\]',' ',text)
text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)
return text
string.punctuationleads to NameError: name 'string' is not defined. Do I miss some imports? \$\endgroup\$import string\$\endgroup\$print(preprocess(example + string.punctuation))to see what's wrong in the 2nd output… \$\endgroup\$