What Are The Pre-processing Steps for CNN Text Classification?

Question

I am currently doing a project where I am using a CNN for text classification on tweet data but am unsure of what pre-processing steps need to be taken before the actual model is coded. I can't seem to find any resources on what pre-processing needs to be done to the actual dataset.

So far I have done special character removal, made each tweet all lower case and initiated stop-word removal. I understand vectorization will need to be done but unsure of other steps or what order they need to be done in.

This is being coded in python.

Is this a practical question, or is it an exercise/homework/etc? — D.W.
– D.W. ♦, Commented Nov 1, 2024 at 20:44

D.W. · Accepted Answer · 2024-11-03 18:15:14Z

There is no one answer on what preprocessing "needs" to be done. You can do anything that works.

In practice, people don't use CNNs for this today; they use large language models, because they perform better. When people used CNNs, one standard approach was to do no preprocessing and use a very large training set. It is also possible to do some preprocessing, and this might help or might hurt -- whether it does is an empirical question.

But since you say this is an exercise, your teacher may have something specific in mind. Therefore, it is best to ask your teacher.

Stack Exchange Network

What Are The Pre-processing Steps for CNN Text Classification?

1 Answer 1

Hot Network Questions

What Are The Pre-processing Steps for CNN Text Classification?

1 Answer 1

Related

Hot Network Questions