Revisions to Conditional removal of columns in sparse matrix

deleted 350 characters in body

Source Link

edited Aug 16, 2016 at 16:50

50.1k
3
130
211

(Updated after discussion in comments.) To take advantage of the sparse matrix representation, you need to rewrite your column-filtering operation. The idea is to operate on Assuming that the sparse representation directly (and not just on its abstract representation as an array), which for CSRthreshold is documented as follows:

the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post)positive, then you can filteruse the whole data>= operator to construct a sparse Boolean array in one go, like thisindicating which points are above or equal to the threshold:

# m is your dataset in sparse arraymatrix formatrepresentation
filtered_dataabove_threshold = m.data <>= v["threshold"]

and then construct a new sparse array which is the same shape as the original but which contains the filtered data:

f = scipy.sparse.csr_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

and now you can applyuse the max operationmethod to get a mask array indicating which columns have all values below the thresholdmaximum entry in each column:

cols = fabove_threshold.max(axis=0) == 0

Digging into the representation is often the wayThis will be 1 for columns that you have any value greater than or equal to work with sparse compressed matrices.

(I originally suggested transformation the matrix into CSC format based on a guess thatthreshold, and 0 for columns where all values are below the threshold. So f.max(axis=0)cols step would be more efficient in CSC format. But looking at the implementation, I don't think this is likely to be the case, so I've dropped that part ofa mask for the answer. It's an easy thing to trycolumns you want to see ifkeep. (If you getneed a small speedupBoolean array, thoughthen use cols == 1.)

(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)

(Updated after discussion in comments.) To take advantage of the sparse matrix representation, you need to rewrite your column-filtering operation. The idea is to operate on the sparse representation directly (and not just on its abstract representation as an array), which for CSR is documented as follows:

the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post), you can filter the whole data array in one go, like this:

# m is your dataset in sparse array format
filtered_data = m.data < v["threshold"]

and then construct a new sparse array which is the same shape as the original but which contains the filtered data:

f = scipy.sparse.csr_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

and now you can apply the max operation to get a mask array indicating which columns have all values below the threshold:

cols = f.max(axis=0) == 0

Digging into the representation is often the way that you have to work with sparse compressed matrices.

(I originally suggested transformation the matrix into CSC format based on a guess that the f.max(axis=0) step would be more efficient in CSC format. But looking at the implementation, I don't think this is likely to be the case, so I've dropped that part of the answer. It's an easy thing to try to see if you get a small speedup, though.)

Assuming that the threshold is positive, then you can use the >= operator to construct a sparse Boolean array indicating which points are above or equal to the threshold:

# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]

and then you can use the max method to get the maximum entry in each column:

cols = above_threshold.max(axis=0)

This will be 1 for columns that have any value greater than or equal to the threshold, and 0 for columns where all values are below the threshold. So cols is a mask for the columns you want to keep. (If you need a Boolean array, then use cols == 1.)

(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)

deleted 350 characters in body

Source Link

edited Aug 16, 2016 at 16:41

Gareth Rees

50.1k
3
130
211

(Updated after discussion in comments.) To take advantage of the sparse matrix representation, you need to rewrite your column-filtering operation. The idea is to operate on the sparse representation directly (and not just on its abstract representation as an array), which for CSR is documented as follows:

the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post), you can filter the whole data array in one go, like this:

# m is your dataset in sparse array format
filtered_data = m.data < v["threshold"]

and then construct a new sparse array which is the same shape as the original but which contains the filtered data:

f = scipy.sparse.csr_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

and now you can apply the max operation to get a mask array indicating which columns have all values below the threshold:

cols = f.max(axis=0) == 0

(Digging Digging into the representation is often the way that you have to work with sparse compressed matrices.

(I originally suggested transformation the matrix into CSC format based on a guess that the f.max(axis=0) step would be more efficient in CSC format. But looking at the implementation, I don't think this is likely to be the case, so I've dropped that part of the answer. It's an easy thing to try to see if you get a small speedup, though.)

(Updated after discussion in comments.) To take advantage of the sparse matrix representation, you need to rewrite your column-filtering operation. The idea is to operate on the sparse representation directly (and not just on its abstract representation as an array), which is documented as follows:

the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post), you can filter the whole data array in one go, like this:

# m is your dataset in sparse array format
filtered_data = m.data < v["threshold"]

and then construct a new sparse array which is the same shape as the original but which contains the filtered data:

f = scipy.sparse.csr_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

and now you can apply the max operation to get a mask array indicating which columns have all values below the threshold:

cols = f.max(axis=0) == 0

(Digging into the representation is often the way that you have to work with sparse compressed matrices.)

(Updated after discussion in comments.) To take advantage of the sparse matrix representation, you need to rewrite your column-filtering operation. The idea is to operate on the sparse representation directly (and not just on its abstract representation as an array), which for CSR is documented as follows:

the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post), you can filter the whole data array in one go, like this:

# m is your dataset in sparse array format
filtered_data = m.data < v["threshold"]

and then construct a new sparse array which is the same shape as the original but which contains the filtered data:

f = scipy.sparse.csr_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

and now you can apply the max operation to get a mask array indicating which columns have all values below the threshold:

cols = f.max(axis=0) == 0

Digging into the representation is often the way that you have to work with sparse compressed matrices.

(I originally suggested transformation the matrix into CSC format based on a guess that the f.max(axis=0) step would be more efficient in CSC format. But looking at the implementation, I don't think this is likely to be the case, so I've dropped that part of the answer. It's an easy thing to try to see if you get a small speedup, though.)

deleted 350 characters in body

Source Link

edited Aug 16, 2016 at 16:36

Gareth Rees

50.1k
3
130
211

The documentation for the CSR format says:

Disadvantages of the CSR format: slow column slicing operations (consider CSC)

And the documentation for the sparse array module says:

All conversions among the CSR, CSC, and COO formats are efficient, linear-time operations.

So why not convert CSR to CSC and then carry out your column filtering operation?

Update: to(Updated after discussion in comments.) To take advantage of CSC formatthe sparse matrix representation, obviously you'd haveyou need to rewrite your column-filtering operation. The idea is to operate on the CSC representation of the sparse matrixrepresentation directly (and not just on its abstract representation as an array), which is documented as followsdocumented as follows:

the rowcolumn indices for columnrow i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post), you shouldcan filter the whole data array in one go, like this:

# m is your dataset in CSC format -- filter thesparse dataarray valuesformat
filtered_data = m.data < v["threshold"]
 
# construct a new sparse array like m but using the filtered data

and then construct a new sparse array which is the same shape as the original but which contains the filtered data:

f = scipy.sparse.csc_matrixcsr_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

# mask indicating which columns have all values below filter

and now you can apply the max operation to get a mask array indicating which columns have all values below the threshold:

cols = f.max(axis=0) == 0

(Maybe this wasn't obvious. But diggingDigging into the representation is often the way that you have to work with sparse compressed matrices.)

The documentation for the CSR format says:

Disadvantages of the CSR format: slow column slicing operations (consider CSC)

And the documentation for the sparse array module says:

All conversions among the CSR, CSC, and COO formats are efficient, linear-time operations.

So why not convert CSR to CSC and then carry out your column filtering operation?

Update: to take advantage of CSC format, obviously you'd have to rewrite your column-filtering operation. The idea is to operate on the CSC representation of the sparse matrix (and not just on its abstract representation as an array), which is documented as follows:

the row indices for column i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post), you should filter the whole data array in one go, like this:

# m is your dataset in CSC format -- filter the data values
filtered_data = m.data < v["threshold"]
 
# construct a new sparse array like m but using the filtered data
f = scipy.sparse.csc_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

# mask indicating which columns have all values below filter
cols = f.max(axis=0) == 0

(Maybe this wasn't obvious. But digging into the representation is often the way that you have to work with sparse compressed matrices.)

(Updated after discussion in comments.) To take advantage of the sparse matrix representation, you need to rewrite your column-filtering operation. The idea is to operate on the sparse representation directly (and not just on its abstract representation as an array), which is documented as follows:

the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]].

So instead of filtering each column separately (as you do in the code in the post), you can filter the whole data array in one go, like this:

# m is your dataset in sparse array format
filtered_data = m.data < v["threshold"]

and then construct a new sparse array which is the same shape as the original but which contains the filtered data:

f = scipy.sparse.csr_matrix((filtered_data, m.indices, m.indptr), shape=m.shape)

and now you can apply the max operation to get a mask array indicating which columns have all values below the threshold:

cols = f.max(axis=0) == 0

(Digging into the representation is often the way that you have to work with sparse compressed matrices.)

explain in deatil

Source Link

edited Aug 16, 2016 at 16:15

Gareth Rees

50.1k
3
130
211

Loading

Source Link

answered Aug 16, 2016 at 15:09

Gareth Rees

50.1k
3
130
211

Loading

Stack Exchange Network

Return to Answer