Best module to do large matrix computation in Python?

Question

I am developing a simple recommendation system and trying to do some computation like SVD, RBM, etc.

To be more convincing, I am going to use the Movielens or Netflix dataset to evaluate the performance of the system. However, the two datasets both have more than 1 million of users and more than 10 thousand of items, it's impossible to put all the data into memory. I have to use some specific modules to handle such a large matrix.

I know there are some tools in SciPy can handle this, and divisi2 used by python-recsys also seems like a good choice. Or maybe there are some better tools I don't know?

Which module should I use? Any suggestion?

scipy.sparse is the standard implementation, used by many third-party libraries. I don't know about divisi2 to compare features, though. — Danica
– Danica, Commented Aug 29, 2012 at 3:57

Glorfindel · Accepted Answer · 2023-03-30 18:00:33Z

6

I would suggest SciPy, specifically Sparse. As Dougal pointed out, Numpy is not suited for this situation.

edited Mar 30, 2023 at 18:00

Glorfindel

22.8k13 gold badges97 silver badges124 bronze badges

answered Aug 29, 2012 at 3:54

Austin Henley

4,63313 gold badges49 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Danica Over a year ago

These matrices are extremely sparse and quite large; storing them in a dense format with even a 1-byte dtype would be about 9GB in RAM. OP definitely wants a sparse matrix.

Austin Henley Over a year ago

Dougal pointed out something that I overlooked in the question so I edited my answer. He was right in downvoting the original answer.

Hoai-Thu Vuong · Accepted Answer · 2012-11-22 07:10:46Z

2

I found another solution named crab, I try finding and comparing some of them.

answered Nov 22, 2012 at 7:10

Hoai-Thu Vuong

1,9671 gold badge13 silver badges14 bronze badges

Comments

specialscope · Accepted Answer · 2012-08-29 04:32:16Z

-1

If your concern is just putting the data in the memory use 64bit python with 64bit numpy. If you dont have enough physical memory you can just increase virtual memory in os level. The size of virtual memory is only limited by your hdd size. Speed of computation however is a different beast!

answered Aug 29, 2012 at 4:32

specialscope

4,2383 gold badges26 silver badges23 bronze badges

3 Comments

Seçkin Savaşçı Over a year ago

it's not an addressing problem in this situation, so x64 suggestion is irrelevant.

specialscope Over a year ago

"However, the two datasets both have more than 1 million of users and more than 10 thousand of items, it's impossible to put all the data into memory" What am I to read from that. So what exactly is the problem about.

shihpeng Over a year ago

I'd like to know which module is best to do computation on vary large matrices. In fact, 1m*10000 matrix is not "large" at all compare to the real world recommendation system running on Netflix or YouTube. So it's not about how large the memory is, it's about sparse matrix computation.

Collectives™ on Stack Overflow

Best module to do large matrix computation in Python?

3 Answers 3

2 Comments

Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

3 Comments

Related