0

In the following code on Google Colab when it reaches to the toarray method, it uses all the RAM. I looked for an answer and it's been suggested the use of HashingVectorizer. How can I implement it in the following code?

The shape of cv.fit_transform(data_list) is (324430, 351550)

# Loading the dataset
data = pd.read_csv("Language Detection.csv")
# value count for each language
data["Language"].value_counts()
# separating the independent and dependant features
X = data["Text"]
y = data["Language"]
# converting categorical variables to numerical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
# creating a list for appending the preprocessed text
data_list = []
# iterating through all the text
for text in X:
    # removing the symbols and numbers
    text = re.sub(r'[!@#$(),n"%^*?:;~`0-9]', ' ', text)
    text = re.sub(r'[[]]', ' ', text)
    # converting the text to lower case
    text = text.lower()
    # appending to data_list
    data_list.append(text)
# creating bag of words using countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(data_list).toarray()
#train test splitting
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
#model creation and prediction
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

1 Answer 1

1

Just don't use toarray. The output of the count vectorizer is a sparse matrix, which MultinomialNB should handle fine it seems.

If you really want to use hashing, you should just be able to replace CountVectorizer by HashingVectorizer.

Sign up to request clarification or add additional context in comments.

1 Comment

I saw this answer in other threads and follow it. When I ran model.fit(x_train, y_train) it executed very fast and returned MultinomialNB(). and I thought "That was fast!, it must be something wrong." and I didn't ran the rest of code. Now that you suggest the same thing, I ran it again and get the same result. but this time I executed the rest of the code. It turns out this is the answer and the fast execution of model.fit(x_train, y_train) fooled me. Thank you.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.