0
$\begingroup$

I am currently doing a project where I am using a CNN for text classification on tweet data but am unsure of what pre-processing steps need to be taken before the actual model is coded. I can't seem to find any resources on what pre-processing needs to be done to the actual dataset.

So far I have done special character removal, made each tweet all lower case and initiated stop-word removal. I understand vectorization will need to be done but unsure of other steps or what order they need to be done in.

This is being coded in python.

$\endgroup$
2
  • $\begingroup$ Is this a practical question, or is it an exercise/homework/etc? $\endgroup$ Commented Nov 1, 2024 at 20:44
  • $\begingroup$ It's for project exercise $\endgroup$ Commented Nov 3, 2024 at 16:21

1 Answer 1

0
$\begingroup$

There is no one answer on what preprocessing "needs" to be done. You can do anything that works.

In practice, people don't use CNNs for this today; they use large language models, because they perform better. When people used CNNs, one standard approach was to do no preprocessing and use a very large training set. It is also possible to do some preprocessing, and this might help or might hurt -- whether it does is an empirical question.

But since you say this is an exercise, your teacher may have something specific in mind. Therefore, it is best to ask your teacher.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.