4

I am developing a simple recommendation system and trying to do some computation like SVD, RBM, etc.

To be more convincing, I am going to use the Movielens or Netflix dataset to evaluate the performance of the system. However, the two datasets both have more than 1 million of users and more than 10 thousand of items, it's impossible to put all the data into memory. I have to use some specific modules to handle such a large matrix.

I know there are some tools in SciPy can handle this, and divisi2 used by python-recsys also seems like a good choice. Or maybe there are some better tools I don't know?

Which module should I use? Any suggestion?

2
  • scipy.sparse is the standard implementation, used by many third-party libraries. I don't know about divisi2 to compare features, though. Commented Aug 29, 2012 at 3:57
  • It seems the choice would be scipy.sparce... Commented Aug 31, 2012 at 4:32

3 Answers 3

6

I would suggest SciPy, specifically Sparse. As Dougal pointed out, Numpy is not suited for this situation.

Sign up to request clarification or add additional context in comments.

2 Comments

These matrices are extremely sparse and quite large; storing them in a dense format with even a 1-byte dtype would be about 9GB in RAM. OP definitely wants a sparse matrix.
Dougal pointed out something that I overlooked in the question so I edited my answer. He was right in downvoting the original answer.
2

I found another solution named crab, I try finding and comparing some of them.

Comments

-1

If your concern is just putting the data in the memory use 64bit python with 64bit numpy. If you dont have enough physical memory you can just increase virtual memory in os level. The size of virtual memory is only limited by your hdd size. Speed of computation however is a different beast!

3 Comments

it's not an addressing problem in this situation, so x64 suggestion is irrelevant.
"However, the two datasets both have more than 1 million of users and more than 10 thousand of items, it's impossible to put all the data into memory" What am I to read from that. So what exactly is the problem about.
I'd like to know which module is best to do computation on vary large matrices. In fact, 1m*10000 matrix is not "large" at all compare to the real world recommendation system running on Netflix or YouTube. So it's not about how large the memory is, it's about sparse matrix computation.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.