3

I have a text file containing data representing a sparse matrix with the following format:

0 234 345
0 236 
0 345 365 465
0 12 35 379

The data is used for a classification task and each row can be considered a feature vector. The first value in each row represents a label, the values following it represent the presence of individual features.

I'm trying to create a sparse matrix with these values (to use in a machine learning task with scikit learn). I've found and read the scipy.sparse documentation but I'm failing to understand how to incrementally build up a sparse matrix with source data like this.

The examples I've found so far show how to take a dense matrix and convert it, or how to create a native sparse matrix with contrived data, but no examples that have helped me here. I did find this related SO question (Building and updating a sparse matrix in python using scipy), but the example assumes you know the max COL, ROW sizes, which I don't, so that data type doesn't seem appropriate.

So far I have the following code to read the document and parse the values into something that seems reasonable:

def get_sparse_matrix():
    matrix = []
    with open("data.dat", 'r') as f:
        for i, line in enumerate(f):
            row = line.strip().split()
            label = row[0]
            features = entry[1:]
            matrix.append([(i, col) for col in features])

    sparse_matrix = #magic happens here

    return sparse_matrix

So questions are,

  • What is the appropriate sparse matrix type to use here?
  • Am I heading in the right direction with the code I have?

Any help is greatly appreciated.

3
  • I don't understand the format, for every element in the matrix, you need row, col, and value. Where is the value` information? To create the sparse matrix incrementally, you can use: docs.scipy.org/doc/scipy-0.14.0/reference/generated/… Commented Nov 15, 2014 at 1:52
  • If it needs to have a value, then it could be 1 or True. Does that clarify it? Commented Nov 15, 2014 at 2:13
  • @HYRY hanks for the tip on dok_matrix, but don't I still need to know the total number of columns when I initialize the dok_matrix? Part of my problem is that I don't reliably know what the max COL value will be. I could write a script that finds out the max value for a given data file, but thought there might be some existing scipy sparse matrix datatype that doesn't require me to specify that. Commented Nov 15, 2014 at 2:55

1 Answer 1

4

You can use coo_matrix():

import numpy as np
from scipy import sparse
data = """0 234 345
0 236 
0 345 365 465
0 12 35 379"""

column_list = []
for line in data.split("\n"):
    values = [int(x) for x in line.strip().split()[1:]]
    column_list.append(values)
lengths = [len(row) for row in column_list]
cols = np.concatenate(column_list)
rows = np.repeat(np.arange(len(column_list)), lengths)
m = sparse.coo_matrix((np.ones_like(rows), (rows, cols)))

Here is the code to check the result:

np.where(m.toarray())

the output:

(array([0, 0, 1, 2, 2, 2, 3, 3, 3]),
 array([234, 345, 236, 345, 365, 465,  12,  35, 379]))
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.