4

What's a good Python library for manipulating very large matrices (e.g. millions of rows/columns), including the ability to add rows or columns at any stage of the matrix's life?

I had looked at pytables and h5py, but neither support adding or removing rows or columns once the matrix is created.

The only other thing I could find was the sparse matrix functionality in numpy/scipy noted in these questions. However, the ability to add/remove rows and columns seems possible but officially unsupported and a bit hacky, so I'm fearing the performance would be horrible with a real dataset. Also, it includes several different sparse matrix implementations, so I'm confused which one would be best (e.g. lil_matrix vs csc_matrix vs csr_matrix).

1 Answer 1

2

If your matrix is sparse you can add or remove rows or columns without hackying with scipy.sparse. If you want to remove columns (do column slicing) you should go for csc_matrix, while the csr_matrix should be used for efficient row slicing. Usually it is convenient to create the sparse matrix using the coo_matrix type, where you can specify the row, col and data for each non-zero entry:

m = coo_matrix((data, (row, col)), shape=(nrow, ncol))
m = m.to_csr()[rows_to_keep, :]
m = m.to_csc()[:, cols_to_keep]

where rows_to_keep can be a list or a 1-D array with the indices to keep.

If you need a dense matrix you can use perhaps the numpy.memmap() array. To create one you can do:

a = np.memmap('test.memmap', dtype='float64', mode='w+', shape=(1000, 1000))
a.fill(100.)

To read one you can do:

a = np.memmap('a.memmap', dtype='float64', mode='r+', shape=(1000, 1000))

If you want to remove or add rows and columns you have to create a second memmap array and then assign the columns that you want from the original one:

b = np.memmap('b.memmap', dtype='float64', mode='w+', shape=(3, 1000))
b = a[[0, 99, 199], :]

this will save in b the first, 100th and 200th rows of a, with all the columns.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, but I'm getting TypeError: 'coo_matrix' object does not support indexing. It seems strange to me that any matrix type couldn't be indexed, since that's the whole purpose of a matrix... I'll assume that is explained in the scipy docs, but docs.scipy.org has been offline the last couple days.
@Cerin yes, you have to convert before using to_csr() or to_csc(), then the indexing should work...
@Cerin I believe the purpose of the coo_matrix is to provide one type of sparse matrix which is easy to populate ans fast to convert to the other types (csr_matrix or csc_matrix for example)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.