I have a dataset X_train, which is an array where each entry is an email (a string of characters). There are 11,314 emails, each of which is about 500 characters long. (X_train is a processed version of the training data in the newsgroups dataset.)
Ultimately, my goal is to build from scratch a tf-idf function (knowledge of which is probably not necessary for answering my question). To get there, I have constructed a lexicon which contains each unique word in X_train once and only once. My lexicon has 211441 elements. I also need an array where each entry frequency_train[i] is the number of emails in which a given term lexicon_train[i] appears.
I construct the frequency array as follows:
frequency_train = np.zeros(211441)
for i in range(211441):
count = 0
for email in X_train:
if lexicon_train[i] in email:
count = count + 1
frequency_train[i] = count
In the same cell, I am also doing something similar with the testing data X_test. I've been running this in Jupyter notebook, and this process takes a while. A previous and very similar task took about 90 minutes. I suspect that I'm doing this task the slowest possible way. Is there a faster way of doing this? I would also welcome answers that explain why this process should take a long time.
X_trainandlexicon_train? Do you only need the totalcount, what are the bounds? It's almost like you're trying to impede us from helping you. \$\endgroup\$”mark”is inlexicon_train, then thein emailwill count"Denmark"and"marker", but not"Mark". Should only complete words be matched? What glyphs can exist in the words? Hyphens or apostrophes? \$\endgroup\$frequency_trainwill contain incorrect counts. If an email contains"i’m going to denmark", andlexicon_traincontains"mark"and"denmark", the email will be counted as containing both those words, because”mark" in "i’m going to denmark"isTrue. It would also be counted as including wordsden,go,in, andarkif those words also appear inlexicon_trainbecausestr in strchecks if the needle appears anywhere in the haystack, without regard for word boundaries. \$\endgroup\$