Skip to main content
edited tags
Link
James Khoury
  • 3.2k
  • 1
  • 25
  • 51
Source Link

Optimize Scipy Sparse Matrix Factorization code for SGD

I'm working on implementing the stochastic gradient descent algorithm for recommender systems (see Funk) using sparse matrices with Scipy.

This is how a first basic implementation looks like:

N = self.model.shape[0] #no of users
M = self.model.shape[1] #no of items
self.p = np.random.rand(N, K)
self.q = np.random.rand(M, K)
rows,cols = self.model.nonzero()        
for step in xrange(steps):
    for u, i in zip(rows,cols):
        e=self.model-np.dot(self.p,self.q.T) #calculate error for gradient
        p_temp = learning_rate * ( e[u,i] * self.q[i,:] - regularization * self.p[u,:])
        self.q[i,:]+= learning_rate * ( e[u,i] * self.p[u,:] - regularization * self.q[i,:])
        self.p[u,:] += p_temp

I have two questions and I am hoping some seasoned python coders / recommender systems experts here could provide me with some insight into this:

  1. How often should the error e be adjusted? It is not clear from Funk's and Koren's papers if it should be once for every rating or if it should be once per every factor. Meaning, should i take it out of the for loop, leave it there, or put it even deeper (over the k iterations) ?

  2. Unfortunately, my code is pretty slow, taking about 20 minutes on ONE iteration over a 100k ratings dataset. This means that computing the factorization over the whole dataset takes about 10 hours. I was thinking that this is probably due to the sparse matrix for loop. I've tried expressing the q and p changes using fancy indexing but since I'm still pretty new at scipy and numpy, I couldn't figure a better way to do it. Do you have any pointers on how i could avoid iterating over the rows and columns of the sparse matrix explicitly?