How to manually create a sparse matrix in Python

Question

I have a text file containing data representing a sparse matrix with the following format:

0 234 345
0 236 
0 345 365 465
0 12 35 379

The data is used for a classification task and each row can be considered a feature vector. The first value in each row represents a label, the values following it represent the presence of individual features.

I'm trying to create a sparse matrix with these values (to use in a machine learning task with scikit learn). I've found and read the scipy.sparse documentation but I'm failing to understand how to incrementally build up a sparse matrix with source data like this.

The examples I've found so far show how to take a dense matrix and convert it, or how to create a native sparse matrix with contrived data, but no examples that have helped me here. I did find this related SO question (Building and updating a sparse matrix in python using scipy), but the example assumes you know the max COL, ROW sizes, which I don't, so that data type doesn't seem appropriate.

So far I have the following code to read the document and parse the values into something that seems reasonable:

def get_sparse_matrix():
    matrix = []
    with open("data.dat", 'r') as f:
        for i, line in enumerate(f):
            row = line.strip().split()
            label = row[0]
            features = entry[1:]
            matrix.append([(i, col) for col in features])

    sparse_matrix = #magic happens here

    return sparse_matrix

So questions are,

What is the appropriate sparse matrix type to use here?
Am I heading in the right direction with the code I have?

Any help is greatly appreciated.

I don't understand the format, for every element in the matrix, you need row, col, and value. Where is the value` information? To create the sparse matrix incrementally, you can use: docs.scipy.org/doc/scipy-0.14.0/reference/generated/… — HYRY
– HYRY, Commented Nov 15, 2014 at 1:52
If it needs to have a value, then it could be 1 or True. Does that clarify it? — Marco Benvoglio
– Marco Benvoglio, Commented Nov 15, 2014 at 2:13
@HYRY hanks for the tip on dok_matrix, but don't I still need to know the total number of columns when I initialize the dok_matrix? Part of my problem is that I don't reliably know what the max COL value will be. I could write a script that finds out the max value for a given data file, but thought there might be some existing scipy sparse matrix datatype that doesn't require me to specify that. — Marco Benvoglio
– Marco Benvoglio, Commented Nov 15, 2014 at 2:55

HYRY · Accepted Answer · 2014-11-15 03:09:44Z

You can use coo_matrix():

import numpy as np
from scipy import sparse
data = """0 234 345
0 236 
0 345 365 465
0 12 35 379"""

column_list = []
for line in data.split("\n"):
    values = [int(x) for x in line.strip().split()[1:]]
    column_list.append(values)
lengths = [len(row) for row in column_list]
cols = np.concatenate(column_list)
rows = np.repeat(np.arange(len(column_list)), lengths)
m = sparse.coo_matrix((np.ones_like(rows), (rows, cols)))

Here is the code to check the result:

np.where(m.toarray())

the output:

(array([0, 0, 1, 2, 2, 2, 3, 3, 3]),
 array([234, 345, 236, 345, 365, 465,  12,  35, 379]))

Collectives™ on Stack Overflow

How to manually create a sparse matrix in Python

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related