(Updated after discussion in comments.) To take advantage of the sparse matrix representation, you need to rewrite your column-filtering operation. The idea is to operate on Assuming that the sparse representation directly (and not just on its abstract representation as an array), which for CSRthreshold is documented as follows:
the column indices for row
iare stored inindices[indptr[i]:indptr[i+1]]and their corresponding values are stored indata[indptr[i]:indptr[i+1]].
So instead of filtering each column separately (as you do in the code in the post)positive, then you can filteruse the whole data>= operator to construct a sparse Boolean array in one go, like thisindicating which points are above or equal to the threshold:
# m is your dataset in sparse arraymatrix formatrepresentation
filtered_dataabove_threshold = m.data <>= v["threshold"]
and then construct a new sparse array which is the same shape as the original but which contains the filtered data:
f = scipy.sparse.csr_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)
and now you can applyuse the max operationmethod to get a mask array indicating which columns have all values below the thresholdmaximum entry in each column:
cols = fabove_threshold.max(axis=0) == 0
Digging into the representation is often the wayThis will be 1 for columns that you have any value greater than or equal to work with sparse compressed matrices.
(I originally suggested transformation the matrix into CSC format based on a guess thatthreshold, and 0 for columns where all values are below the threshold. So f.max(axis=0)cols step would be more efficient in CSC format. But looking at the implementation, I don't think this is likely to be the case, so I've dropped that part ofa mask for the answer. It's an easy thing to trycolumns you want to see ifkeep. (If you getneed a small speedupBoolean array, thoughthen use cols == 1.)
(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)