Using sklearn's toarray method results in the use of all RAM

Question

In the following code on Google Colab when it reaches to the toarray method, it uses all the RAM. I looked for an answer and it's been suggested the use of HashingVectorizer. How can I implement it in the following code?

The shape of cv.fit_transform(data_list) is (324430, 351550)

# Loading the dataset
data = pd.read_csv("Language Detection.csv")
# value count for each language
data["Language"].value_counts()
# separating the independent and dependant features
X = data["Text"]
y = data["Language"]
# converting categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
# creating a list for appending the preprocessed text
data_list = []
# iterating through all the text
for text in X:
    # removing the symbols and numbers
    text = re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text)
    text = re.sub(r'[[]]', ' ', text)
    # converting the text to lower case
    text = text.lower()
    # appending to data_list
    data_list.append(text)
# creating bag of words using countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()
#train test splitting
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
#model creation and prediction
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

Ben Reiniger · Accepted Answer · 2022-07-04 02:02:25Z

1

Just don't use toarray. The output of the count vectorizer is a sparse matrix, which MultinomialNB should handle fine it seems.

If you really want to use hashing, you should just be able to replace CountVectorizer by HashingVectorizer.

answered Jul 4, 2022 at 2:02

Ben Reiniger

13.3k3 gold badges23 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Asdoost Over a year ago

I saw this answer in other threads and follow it. When I ran model.fit(x_train, y_train) it executed very fast and returned MultinomialNB(). and I thought "That was fast!, it must be something wrong." and I didn't ran the rest of code. Now that you suggest the same thing, I ran it again and get the same result. but this time I executed the rest of the code. It turns out this is the answer and the fast execution of model.fit(x_train, y_train) fooled me. Thank you.

Collectives™ on Stack Overflow

Using sklearn's toarray method results in the use of all RAM

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related