1

I have created a sparse representation of data and want to convert this into a Numpy array.

Let's say, I have the following data (in practice data contains many more lists and each list is much longer):

data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]

And I have two dict items that maps each word to an unique integer value and vice versa:

w2i = {'this':0, 'is':1, 'my':2, 'first':3, 'dataset':4, 'here':5, 'but':6, 'another':7, 'one':8, 'and':9, 'yet':10}

Furthermore, I have a dict that gets the count for each word combination:

comb_dict = dict()
for text in data:
    sorted_set_text = sorted(list(set(text)))
    for i in range(len(sorted_set_text)-1):
        for j in range(i+1, len(sorted_set_text)):
            if (sorted_set_text[i],sorted_set_text[j]) in comb_dict:
                comb_dict[(sorted_set_text[i],sorted_set_text[j])] += 1
            else:
                comb_dict[(sorted_set_text[i],sorted_set_text[j])] = 1

From this dict, I create a sparse representation as follows:

sparse = [(w2i[k[0]],w2i[k[1]],v) for k,v in comb_dict.items()]

This list consists of tuples in which the first value indicates the location of the x-axis, the second value the location of the y-axis and the third value the number of co-occurrences:

[(4, 3, 1),
 (4, 5, 1),
 (4, 1, 1),
 (4, 2, 1),
 (4, 0, 1),
 (3, 5, 1),
 (3, 1, 1),
 (3, 2, 1),
 (3, 0, 1),
 (5, 1, 2),
 (5, 2, 1),
 (5, 0, 1),
 (1, 2, 1),
 (1, 0, 1),
 (2, 0, 1),
 (7, 6, 1),
 (7, 5, 1),
 (7, 1, 1),
 (7, 8, 2),
 (6, 5, 1),
 (6, 1, 1),
 (6, 8, 1),
 (5, 8, 1),
 (1, 8, 1),
 (9, 7, 1),
 (9, 8, 1),
 (9, 10, 1),
 (7, 10, 1),
 (8, 10, 1)]

Now, I want to get a Numpy array (11 x 11) in which each row i and column j represent a word and the cells indicate how often word i and j co-occur. Thus, a start would be

cooc = np.zeros((len(w2i),len(w2i)), dtype=np.int16)

Then, I want to update cooc so that the row/column indices associated with the word combinations in sparse will be assigned the associated value. How can I do this?

EDIT: I am aware that I can loop through cooc and assign each cell one by one. However, my dataset is large and this will be time-intensive. Instead, I would like to convert cooc into a Scipy sparse matrix and use the toarray() method. How can I do this?

3 Answers 3

2

I think these other answers are kinda reinventing a wheel that already exists.

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer 

data = [['this','is','my','first','dataset','here'],['but','here', 'is', 'another','one'],['and','yet', 'another', 'one']]

I'm going to put these back together and just use sklearn's CountVectorizer

data = [" ".join(x) for x in data]
encoder = CountVectorizer()
occurrence = encoder.fit_transform(data)

This occurrence matrix is a sparse matrix, and turning it into a co-occurrence matrix is just a simple multiplication (the diagonal is the total number of times each token appears).

co_occurrence = occurrence.T @ occurrence

>>> co_occurrence.A

array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
       [0, 1, 1, 1, 1, 2, 2, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [1, 2, 1, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1]])

And the row/column labels can be recovered from the encoder:

encoder.vocabulary_

{'this': 9,
 'is': 6,
 'my': 7,
 'first': 4,
 'dataset': 3,
 'here': 5,
 'but': 2,
 'another': 1,
 'one': 8,
 'and': 0,
 'yet': 10}
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @CJR. Is there a reason why you import vstack? You don't seem to use it.
Oh, no, that's just a copy and paste error. You don't need it.
2
In [268]: alist = [(4, 3, 1),
     ...:  (4, 5, 1),
     ...:  (4, 1, 1),
     ...:  (4, 2, 1),
...
     ...:  (9, 10, 1),
     ...:  (7, 10, 1),
     ...:  (8, 10, 1)]

Make an array from the list:

In [269]: arr = np.array(alist)
In [270]: arr.shape
Out[270]: (29, 3)

and use the columns of the array to fill slots in a defined (11,11) array:

In [271]: res = np.zeros((11,11),int)
In [272]: res[arr[:,0],arr[:,1]]=arr[:,2]
In [273]: res
Out[273]: 
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

You can use the same columns to create a scipy.sparse matrix.

In [274]: from scipy import sparse
In [276]: M = sparse.coo_matrix((arr[:,2],(arr[:,0],arr[:,1])), shape=(11,11))
In [277]: M
Out[277]: 
<11x11 sparse matrix of type '<class 'numpy.int64'>'
    with 29 stored elements in COOrdinate format>
In [278]: M.A
Out[278]: 
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 1, 1, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

1 Comment

I like this way of doing it - less code and may be faster than my way. However, your output is currently not right, e.g. row 0 is all 0s, even though word 0 neighbours [1,2,3,4,5]. I think you need the line res[arr[:,1],arr[:,0]]=arr[:,2] to add the associations in both directions.
1

Edit after additional constraint re not using naive loop

You can use numba with the JIT decorator to compile the loop before execution. This will be a lot faster than a naive loop. If you use dtype=np.uint16 you will use 1/4 of the RAM of the default int64 (max value is 65535, so you should be OK with 60k words).

I tried this with 3 600 000 000 (num_words2) combinations and it ran in 43 seconds on my c2016 laptop.

Full code:

from numba import jit

# Create array of each word with dtype uint16 (max value 65535)
n_unique_words = 11
start_words = np.array([row[0] for row in sparse], dtype = np.uint16)
neighbour_words = np.array([row[1] for row in sparse], dtype = np.uint16)
frequences = np.array([row[2] for row in sparse], dtype = np.uint16)


# Numba loop method
@jit(nopython=True) # Set "nopython" mode for best performance, equivalent to @njit
def make_cooc_fast(start_words, neighbour_words, frequencies, n_unique_words): # Function is compiled to machine code when called the first time
    
    big_cooc = np.zeros((n_unique_words, n_unique_words), dtype=np.int16)
    
    for i in range(0, len(start_words)):
        big_cooc[start_words[i], neighbour_words[i]] = frequencies[i]
        big_cooc[neighbour_words[i], start_words[i]] = frequencies[i]
    
    return big_cooc

cooc = make_cooc_fast(
    start_words=start_words, 
    neighbour_words=neighbour_words,
    frequencies=frequences,
    n_unique_words=n_unique_words)

Original response

If I understand the question correctly, you have done all the hard work. From here, it should just be:

for row in sparse:
    cooc[row[0], row[1]] = row[2]
    cooc[row[1], row[0]] = row[2]

It is two lines because if word a appears next to word b n times, then word b also appears next to word a n times.

This area an area of interest for me - feel free to PM me if you want to discuss further.

3 Comments

Thank you SamR! My original dataset is much larger and looping through it, will take much time. I was thinking about a conversion to Scipy and then using the toarray() method for conversion, which I expect to be more efficient. Do you have experience with this?
How large is it? I have not used scipy so can't comment on that. The operation is very simple - if scipy has a builtin great, if not I would consider using the Cython numpy integration: cython.readthedocs.io/en/latest/src/tutorial/numpy.html . Hard to say what the speedup would be without trying it.
The dense matrix will be approximately 60k x 60k. I can't send you a dm on SO btw. Please look for my contact information on my page if you want to discuss this topic further.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.