Create Numpy array from sparse representation

Question

I have created a sparse representation of data and want to convert this into a Numpy array.

Let's say, I have the following data (in practice data contains many more lists and each list is much longer):

data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]

And I have two dict items that maps each word to an unique integer value and vice versa:

w2i = {'this':0, 'is':1, 'my':2, 'first':3, 'dataset':4, 'here':5, 'but':6, 'another':7, 'one':8, 'and':9, 'yet':10}

Furthermore, I have a dict that gets the count for each word combination:

comb_dict = dict()
for text in data:
    sorted_set_text = sorted(list(set(text)))
    for i in range(len(sorted_set_text)-1):
        for j in range(i+1, len(sorted_set_text)):
            if (sorted_set_text[i],sorted_set_text[j]) in comb_dict:
                comb_dict[(sorted_set_text[i],sorted_set_text[j])] += 1
            else:
                comb_dict[(sorted_set_text[i],sorted_set_text[j])] = 1

From this dict, I create a sparse representation as follows:

sparse = [(w2i[k[0]],w2i[k[1]],v) for k,v in comb_dict.items()]

This list consists of tuples in which the first value indicates the location of the x-axis, the second value the location of the y-axis and the third value the number of co-occurrences:

[(4, 3, 1),
 (4, 5, 1),
 (4, 1, 1),
 (4, 2, 1),
 (4, 0, 1),
 (3, 5, 1),
 (3, 1, 1),
 (3, 2, 1),
 (3, 0, 1),
 (5, 1, 2),
 (5, 2, 1),
 (5, 0, 1),
 (1, 2, 1),
 (1, 0, 1),
 (2, 0, 1),
 (7, 6, 1),
 (7, 5, 1),
 (7, 1, 1),
 (7, 8, 2),
 (6, 5, 1),
 (6, 1, 1),
 (6, 8, 1),
 (5, 8, 1),
 (1, 8, 1),
 (9, 7, 1),
 (9, 8, 1),
 (9, 10, 1),
 (7, 10, 1),
 (8, 10, 1)]

Now, I want to get a Numpy array (11 x 11) in which each row i and column j represent a word and the cells indicate how often word i and j co-occur. Thus, a start would be

cooc = np.zeros((len(w2i),len(w2i)), dtype=np.int16)

Then, I want to update cooc so that the row/column indices associated with the word combinations in sparse will be assigned the associated value. How can I do this?

EDIT: I am aware that I can loop through cooc and assign each cell one by one. However, my dataset is large and this will be time-intensive. Instead, I would like to convert cooc into a Scipy sparse matrix and use the toarray() method. How can I do this?

CJR · Accepted Answer · 2022-01-14 14:44:11Z

I think these other answers are kinda reinventing a wheel that already exists.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer 

data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]

I'm going to put these back together and just use sklearn's CountVectorizer

data = [" ".join(x) for x in data]
encoder = CountVectorizer()
occurrence = encoder.fit_transform(data)

This occurrence matrix is a sparse matrix, and turning it into a co-occurrence matrix is just a simple multiplication (the diagonal is the total number of times each token appears).

co_occurrence = occurrence.T @ occurrence

>>> co_occurrence.A

array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
       [0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1]])

And the row/column labels can be recovered from the encoder:

encoder.vocabulary_

{'this': 9,
 'is': 6,
 'my': 7,
 'first': 4,
 'dataset': 3,
 'here': 5,
 'but': 2,
 'another': 1,
 'one': 8,
 'and': 0,
 'yet': 10}

Thank you @CJR. Is there a reason why you import vstack? You don't seem to use it.
Oh, no, that's just a copy and paste error. You don't need it.

hpaulj · Accepted Answer · 2022-01-12 18:24:26Z

In [268]: alist = [(4, 3, 1),
     ...:  (4, 5, 1),
     ...:  (4, 1, 1),
     ...:  (4, 2, 1),
...
     ...:  (9, 10, 1),
     ...:  (7, 10, 1),
     ...:  (8, 10, 1)]

Make an array from the list:

In [269]: arr = np.array(alist)
In [270]: arr.shape
Out[270]: (29, 3)

and use the columns of the array to fill slots in a defined (11,11) array:

In [271]: res = np.zeros((11,11),int)
In [272]: res[arr[:,0],arr[:,1]]=arr[:,2]
In [273]: res
Out[273]: 
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

You can use the same columns to create a scipy.sparse matrix.

In [274]: from scipy import sparse
In [276]: M = sparse.coo_matrix((arr[:,2],(arr[:,0],arr[:,1])), shape=(11,11))
In [277]: M
Out[277]: 
<11x11 sparse matrix of type '<class 'numpy.int64'>'
    with 29 stored elements in COOrdinate format>
In [278]: M.A
Out[278]: 
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

I like this way of doing it - less code and may be faster than my way. However, your output is currently not right, e.g. row 0 is all 0s, even though word 0 neighbours [1,2,3,4,5]. I think you need the line res[arr[:,1],arr[:,0]]=arr[:,2] to add the associations in both directions.

SamR · Accepted Answer · 2022-01-12 21:18:52Z

Edit after additional constraint re not using naive loop

You can use numba with the JIT decorator to compile the loop before execution. This will be a lot faster than a naive loop. If you use dtype=np.uint16 you will use 1/4 of the RAM of the default int64 (max value is 65535, so you should be OK with 60k words).

I tried this with 3 600 000 000 (num_words²) combinations and it ran in 43 seconds on my c2016 laptop.

Full code:

from numba import jit

# Create array of each word with dtype uint16 (max value 65535)
n_unique_words = 11
start_words = np.array([row[0] for row in sparse], dtype = np.uint16)
neighbour_words = np.array([row[1] for row in sparse], dtype = np.uint16)
frequences = np.array([row[2] for row in sparse], dtype = np.uint16)


# Numba loop method
@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def make_cooc_fast(start_words, neighbour_words, frequencies, n_unique_words): # Function is compiled to machine code when called the first time
    
    big_cooc = np.zeros((n_unique_words, n_unique_words), dtype=np.int16)
    
    for i in range(0, len(start_words)):
        big_cooc[start_words[i], neighbour_words[i]] = frequencies[i]
        big_cooc[neighbour_words[i], start_words[i]] = frequencies[i]
    
    return big_cooc

cooc = make_cooc_fast(
    start_words=start_words, 
    neighbour_words=neighbour_words,
    frequencies=frequences,
    n_unique_words=n_unique_words)

Original response

If I understand the question correctly, you have done all the hard work. From here, it should just be:

for row in sparse:
    cooc[row[0], row[1]] = row[2]
    cooc[row[1], row[0]] = row[2]

It is two lines because if word a appears next to word b n times, then word b also appears next to word a n times.

This area an area of interest for me - feel free to PM me if you want to discuss further.

Thank you SamR! My original dataset is much larger and looping through it, will take much time. I was thinking about a conversion to Scipy and then using the toarray() method for conversion, which I expect to be more efficient. Do you have experience with this?
How large is it? I have not used scipy so can't comment on that. The operation is very simple - if scipy has a builtin great, if not I would consider using the Cython numpy integration: cython.readthedocs.io/en/latest/src/tutorial/numpy.html . Hard to say what the speedup would be without trying it.
The dense matrix will be approximately 60k x 60k. I can't send you a dm on SO btw. Please look for my contact information on my page if you want to discuss this topic further.

Collectives™ on Stack Overflow

Create Numpy array from sparse representation

3 Answers 3

2 Comments

1 Comment

3 Comments

Linked

Hot Network Questions