Return to Revisions

2 of 5

explain in deatil

edited Aug 16, 2016 at 16:15

Gareth Rees

50.1k
3
130
211

The documentation for the CSR format says:

Disadvantages of the CSR format: slow column slicing operations (consider CSC)

And the documentation for the sparse array module says:

All conversions among the CSR, CSC, and COO formats are efficient, linear-time operations.

So why not convert CSR to CSC and then carry out your column filtering operation?

Update: to take advantage of CSC format, obviously you'd have to rewrite your column-filtering operation. The idea is to operate on the CSC representation of the sparse matrix (and not just on its abstract representation as an array), which is documented as follows:

the row indices for column i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post), you should filter the whole data array in one go, like this:

# m is your dataset in CSC format -- filter the data values
filtered_data = m.data < v["threshold"]

# construct a new sparse array like m but using the filtered data
f = scipy.sparse.csc_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

# mask indicating which columns have all values below filter
cols = f.max(axis=0) == 0

(Maybe this wasn't obvious. But digging into the representation is often the way that you have to work with sparse compressed matrices.)

answered Aug 16, 2016 at 15:09

Gareth Rees

50.1k
3
130
211