Constructing a lexicon and frequency array from a long string

Question

I have a dataset X_train, which is an array where each entry is an email (a string of characters). There are 11,314 emails, each of which is about 500 characters long. (X_train is a processed version of the training data in the newsgroups dataset.)

Ultimately, my goal is to build from scratch a tf-idf function (knowledge of which is probably not necessary for answering my question). To get there, I have constructed a lexicon which contains each unique word in X_train once and only once. My lexicon has 211441 elements. I also need an array where each entry frequency_train[i] is the number of emails in which a given term lexicon_train[i] appears.

I construct the frequency array as follows:

frequency_train = np.zeros(211441)
for i in range(211441):
    count = 0
    for email in X_train:
        if lexicon_train[i] in email:
           count = count + 1
    frequency_train[i] = count

In the same cell, I am also doing something similar with the testing data X_test. I've been running this in Jupyter notebook, and this process takes a while. A previous and very similar task took about 90 minutes. I suspect that I'm doing this task the slowest possible way. Is there a faster way of doing this? I would also welcome answers that explain why this process should take a long time.

This is missing information to be able to provide an answer that contains helpful content. What are X_train and lexicon_train? Do you only need the total count, what are the bounds? It's almost like you're trying to impede us from helping you. — Peilonrayz
– Peilonrayz ♦, Commented Mar 26, 2020 at 1:39
@Peilonrayz I think my question explains what X_train is: it's an array where each entry is an email. X_train is a processed version of the training data from the very well-known newsgroups dataset. Specifically, X_train is what results after 1) converting all characters in the newsgroups dataset to lower case letters, 2) removing "stopwords" which are the 200 or so most common words of the English language and 3) converting all remaining words to their stems (via the ps.stem() function in the nltk library). I have also explained what lexicon_train is: a lexicon, obtained from X_train — co-contravariant
– co-contravariant, Commented Mar 26, 2020 at 1:47
What constitutes a unique word? If ”mark” is in lexicon_train, then the in email will count "Denmark" and "marker", but not "Mark". Should only complete words be matched? What glyphs can exist in the words? Hyphens or apostrophes? — AJNeufeld
– AJNeufeld, Commented Mar 26, 2020 at 3:42
Why did you separate constructing the dictionary from establishing the counts? — greybeard
– greybeard, Commented Mar 26, 2020 at 5:03
I will be clearer: Your existing frequency_train will contain incorrect counts. If an email contains "i’m going to denmark", and lexicon_train contains "mark" and "denmark", the email will be counted as containing both those words, because ”mark" in "i’m going to denmark" is True. It would also be counted as including words den, go, in, and ark if those words also appear in lexicon_train because str in str checks if the needle appears anywhere in the haystack, without regard for word boundaries. — AJNeufeld
– AJNeufeld, Commented Mar 26, 2020 at 5:40

RootTwo · Accepted Answer · 2020-03-26 03:44:11Z

10

For each word in the lexicon you are searching through each email: (11,314 emails) * (60 words/email) * (211441 word lexicon) = lots of comparisons.

Flip it around. Use collections.Counter. Get the unique words in each email (use a set()) and then and update the counter.

from collections import Counter

counts = Counter()

for email in x_train:
    words = set(email.split())   # <= or whatever you use to parse the words
    counts.update(words)

This will give you a dict mapping words in the emails to the number of emails they are in. (11,314 emails) * (60 words/email) = a lot fewer loops. This probably also recreated the lexicon (e.g. counter.keys() should be the lexicon.

On my computer, it takes 7 seconds to generate 115000 random 60-word emails and collect the counts.

edited Mar 26, 2020 at 3:44

answered Mar 26, 2020 at 3:30

RootTwo

10.7k1 gold badge14 silver badges30 bronze badges

\$\begingroup\$ Yes, I would go with this answer. Improves performance and reduces code required. Nice job. \$\endgroup\$

Ben A
– Ben A

2020-03-26 06:19:53 +00:00
Commented Mar 26, 2020 at 6:19

Add a comment |

Ben A · Accepted Answer · 2020-03-26 00:22:43Z

1

Your for loop can be reduced to one line, utilizing sum:

frequency_train = [
    sum(1 if lexicon_train[i] in email else 0 for email in X_train) for i in range(211441)
]

It removes the need to create the initial list of zeros. For performance, I'm guessing the size of the lexicon and the number of iterations are slowing it down.

edited Mar 26, 2020 at 0:22

answered Mar 26, 2020 at 0:16

Ben A

10.8k5 gold badges38 silver badges103 bronze badges

\$\begingroup\$ Thank you! This does indeed simplify my code. However, about 15 minutes later, the cell is still running. There may very well be no way around this: just running through the for loops requires about 2.2 billion steps, not to mention the other computations that the entire cell requires. I'm pretty new here, so I'll defer to the community as to whether or not I should accept this as an answer. \$\endgroup\$

co-contravariant
– co-contravariant

2020-03-26 00:34:16 +00:00
Commented Mar 26, 2020 at 0:34
1

\$\begingroup\$ @co-contravariant This answer is really just about reducing the lines in your program and utilizing a built in function. If an answer comes along that reduces your performance, definitely go with that one. \$\endgroup\$

Ben A
– Ben A

2020-03-26 00:47:43 +00:00
Commented Mar 26, 2020 at 0:47
\$\begingroup\$ For anyone in the audience who's curious: the process has finally terminated. It took about an hour. Now onto the testing data... \$\endgroup\$

co-contravariant
– co-contravariant

2020-03-26 01:34:41 +00:00
Commented Mar 26, 2020 at 1:34
\$\begingroup\$ Cf. Histogram word counter in Python \$\endgroup\$

greybeard
– greybeard

2020-03-26 06:10:54 +00:00
Commented Mar 26, 2020 at 6:10

Add a comment |

Stack Exchange Network

Constructing a lexicon and frequency array from a long string

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Constructing a lexicon and frequency array from a long string

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions