I have created a sparse representation of data and want to convert this into a Numpy array.
Let's say, I have the following data (in practice data contains many more lists and each list is much longer):
data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]
And I have two dict items that maps each word to an unique integer value and vice versa:
w2i = {'this':0, 'is':1, 'my':2, 'first':3, 'dataset':4, 'here':5, 'but':6, 'another':7, 'one':8, 'and':9, 'yet':10}
Furthermore, I have a dict that gets the count for each word combination:
comb_dict = dict()
for text in data:
sorted_set_text = sorted(list(set(text)))
for i in range(len(sorted_set_text)-1):
for j in range(i+1, len(sorted_set_text)):
if (sorted_set_text[i],sorted_set_text[j]) in comb_dict:
comb_dict[(sorted_set_text[i],sorted_set_text[j])] += 1
else:
comb_dict[(sorted_set_text[i],sorted_set_text[j])] = 1
From this dict, I create a sparse representation as follows:
sparse = [(w2i[k[0]],w2i[k[1]],v) for k,v in comb_dict.items()]
This list consists of tuples in which the first value indicates the location of the x-axis, the second value the location of the y-axis and the third value the number of co-occurrences:
[(4, 3, 1),
(4, 5, 1),
(4, 1, 1),
(4, 2, 1),
(4, 0, 1),
(3, 5, 1),
(3, 1, 1),
(3, 2, 1),
(3, 0, 1),
(5, 1, 2),
(5, 2, 1),
(5, 0, 1),
(1, 2, 1),
(1, 0, 1),
(2, 0, 1),
(7, 6, 1),
(7, 5, 1),
(7, 1, 1),
(7, 8, 2),
(6, 5, 1),
(6, 1, 1),
(6, 8, 1),
(5, 8, 1),
(1, 8, 1),
(9, 7, 1),
(9, 8, 1),
(9, 10, 1),
(7, 10, 1),
(8, 10, 1)]
Now, I want to get a Numpy array (11 x 11) in which each row i and column j represent a word and the cells indicate how often word i and j co-occur. Thus, a start would be
cooc = np.zeros((len(w2i),len(w2i)), dtype=np.int16)
Then, I want to update cooc so that the row/column indices associated with the word combinations in sparse will be assigned the associated value. How can I do this?
EDIT: I am aware that I can loop through cooc and assign each cell one by one. However, my dataset is large and this will be time-intensive. Instead, I would like to convert cooc into a Scipy sparse matrix and use the toarray() method. How can I do this?