Creating a matrix from CSV file

Question

I've been working on Python for around 2 months now so I have a OK understanding of it.

My goal is to create a matrix using CSV data, then populating that matrix from the data in the 3rd column of that CSV file.

I came up with this code thus far:

import csv

import csv
def readcsv(csvfile_name):
      with open(csvfile_name) as csvfile:
      file=csv.reader(csvfile, delimiter=",")

#remove rubbish data in first few rows

      skiprows = int(input('Number of rows to skip? '))
           for i in range(skiprows):
                _ = next(file)

#change strings into integers/floats

            for z in file:
                 z[:2]=map(int, z[:2])
                 z[2:]=map(float, z[2:])
                 print(z[:2])
        return

After removing the rubbish data with the above code, the data in the CSV file looks like this:

   Input:
   1  1  51 9 3 
   1  2  39 4 4
   1  3  40 3 9
   1  4  60 2 . 
   1  5  80 2 .
   2  1  40 6 .
   2  2  28 4 .
   2  3  40 2 .
   2  4  39 3 . 
   3  1  10 . .
   3  2  20 . .
   3  3  30 . .
   3  4  40 . .
   .  .   . . .

The output should look like this:

      1   2   3   4  .  .
   1  51  39  40  60
   2  40  28  40  39
   3  10  20  30  40
   .
   .

There are about a few thousand rows and columns in this CSV file, however I'm only interested is the first 3 columns of the CSV file. So the first and second columns are basically like co-ordinates for the matrix, and then populating the matrix with data in the 3rd column.

After lots of trial and error, I realised that numpy was the way to go with matrices. This is what I tried thus far with example data:

  left_column =   [1, 2, 1, 2, 1, 2, 1, 2]
  middle_column = [1, 1, 3, 3, 2, 2, 4, 4]
  right_column =  [1., 5., 3., 7., 2., 6., 4., 8.]

  import numpy as np
  m = np.zeros((max(left_column), max(middle_column)), dtype=np.float)
  for x, y, z in zip(left_column, middle_column, right_column):
      x -= 1 # Because the indicies are 1-based
      y -= 1 # Need to be 0-based
      m[x, y] = z
  print(m)

  #: array([[ 1., 2., 3., 4.],
  #:        [ 5., 6., 7., 8.]])

However, it is unrealistic for me to specify all of my data in my script to generate the matrix. I tried using the generators to pull the data out of my CSV file but it didn't work well for me.

I learnt as much numpy as I could, however it appears like it requires my data to already be in matrix form, which it isn't.

I don't understand the meaning of the last two columns. The first three are clear... (row, column, value) — Nikaido
– Nikaido, Commented Nov 7, 2016 at 12:05

Scott Griffiths · Accepted Answer · 2016-11-07 13:22:52Z

4

You should seriously consider using pandas. It is really ideal for this sort of work. I can't give you an actual solution because I don't have your data, but I would try something like the following:

import pandas as pd
df = pd.read_csv('test.csv', usecols=[0,1,2], names=['A', 'B', 'C'])
pd.pivot_table(df, index='A', columns='B', values='C')

The second line imports the data into a pandas DataFrame object (change the names into something more useful for your application). The pivot table creates the matrix you are looking for, and gracefully handles any missing data.

answered Nov 7, 2016 at 13:22

Scott Griffiths

6846 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

dizzyLife Over a year ago

Thanks for the comment. Would it be helpful if I sent the CSV file? I couldn't get it to work after numerous tries. The error is: AttributeError: 'module' object has no attribute 'read_csv'

Scott Griffiths Over a year ago

@dizzyLife: Which version of Pandas are you using? If you have imported pandas as pd, type "pd.__version__" into python. I am using pandas 0.18.0, so maybe you need a newer pandas version?

Scott Griffiths Over a year ago

@dizzyLife: Ignore earlier comment, read_csv has been in pandas since the beginning. Your error probably means pandas is not installed correctly. Did running "import pandas as pd" generate any exceptions?

dizzyLife Over a year ago

Is there a way where I'd be able to use that code, and then skip a few rows? A lot of rubbish data I don't want.

Scott Griffiths Over a year ago

If you mean at the beginning of the file, you can use the skiprows option in read_csv. Otherwise, pivot_table returns a Pandas DataFrame, which has many options for slicing and selecting data. See here: pandas.pydata.org/pandas-docs/stable/indexing.html

|

Saullo G. P. Castro · Accepted Answer · 2016-11-07 11:48:15Z

3

You can use scipy.sparse.coo_matrix to load this data very conveniently.

Working with your input:

 Input:
   1  1  51 9 3 
   1  2  39 4 4
   1  3  40 3 9
   1  4  60 2 . 
   1  5  80 2 .
   2  1  40 6 .
   2  2  28 4 .
   2  3  40 2 .
   2  4  39 3 . 
   3  1  10 . .
   3  2  20 . .
   3  3  30 . .
   3  4  40 . .
   .  .   . . .

You could do:

l, c, v = np.loadtxt('test.txt', skiprows=1).T
m = coo_matrix((v, (l-1, c-1)), shape=(l.max(), c.max()))

Then you can convert the coo_matrix to a np.ndarray:

In [9]: m.toarray()
Out[9]:
array([[ 51.,  39.,  40.,  60.,  80.],
       [ 40.,  28.,  40.,  39.,   0.],
       [ 10.,  20.,  30.,  40.,   0.]])

answered Nov 7, 2016 at 11:48

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

7 Comments

dizzyLife Over a year ago

Hey, firstly thanks for the help. I tried to run the code by replacing test.txt with a csv file called file.csv, but I got the error: IDLE's subprocess didn't make connection. Either IDLE can't start a subprocess or personal software is blocking the connection. Does this meman I just have to put all the data in a notepad?

Saullo G. P. Castro Over a year ago

@dizzyLife sure, but be sure you have kept only the valid data, in this case I kept only up to the third column, otherwise you would have to do: l, c, v = np.loadtxt("file.csv", skiprows=1).T[:3, :] to limit reading up to the third colum (when transposed up to the third row)

Saullo G. P. Castro Over a year ago

@dizzyLife also, check if your delimiter in the csv file is something different from blank spaces. If yes, you have to pass delimiter="," to the loadtxt function (or another delimiter character that you have there)

dizzyLife Over a year ago

The file is too big for me to copy paste data into a seperate file, hence the use of a csv. Is there any way I could contact you?

Saullo G. P. Castro Over a year ago

@dizzyLife no need to copy all the data, just load using l, c, v = np.loadtxt("file.csv", skiprows=1).T[:3, :], or pass delimiter if necessary

|

Nikaido · Accepted Answer · 2016-11-08 13:44:52Z

This is my solution using only the csv library, and working with the index\position in the csv (using an offset which I use to mantain memory on the current row)

import csv

with open('test.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    list_of_list = []
    j=0
    lines = [line for line in spamreader]
    for i in range(len(lines)):
        list_ = []
        if(len(lines)<=i+j):
            break;
        first = lines[i+j][0]
        while(first == lines[i+j][0]):
            list_.append(lines[i+j][2])
            j+=1
            if(len(lines)<=i+j):
                break;
        j-=1
        list_of_list.append(list(map(float,list_)))

maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
    print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")

Anyway the solution posted by Saullo is more elegant

This is my output:

        1       2       3       4       5

1       51.0    39.0    40.0    60.0    80.0

2       40.0    28.0    40.0    39.0

3       10.0    20.0    30.0    40.0

I wrote a new version of the code with an iterator, because the csv is too big to fit in memory

import csv

with open('test.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',')
    list_of_list = []

    line1 = next(spamreader)
    first = line1[0]
    list_ = [line1[2]]
    for line in spamreader:
        while(line[0] == first):
            list_.append(line[2])
            try:
                line = next(spamreader)
            except :
                break;
        list_of_list.append(list(map(float,list_)))
        list_ = [line[2]]
        first = line[0]

maxlen = len(max(list_of_list))
print("\t"+"\t".join([str(el) for el in range(1,maxlen+1)])+"\n")
for i in range(len(list_of_list)):
    print(str(i+1)+"\t"+"\t".join([str(el) for el in list_of_list[i]])+"\n")

Anyway probably you need to work on the Matrix in chunks (and doing swaps) because probably the data wan't fit in a 2d array

Hey, I tried running the code and there was a error on line 7 with "MemoryError". Any thoughts?
You used as input the csv which you posted before, or another csv? a bigger one, maybe? I haven't tested it on bigger example
probably your csv is too big to fit in memory, so you need to use an iterator
Yeah, the CSV file is around 800mbs, so it's pretty large. I tried using generators/iterators in the past, but was unsuccessful in doing so. Also resorted to just use list comprehension but was also unable to conver it :( So i thought numpy was the way to go. I could send the CSV file if needed
I don't need it, later I'll try to write something with an iterator

Collectives™ on Stack Overflow

Creating a matrix from CSV file

3 Answers 3

6 Comments

7 Comments

17 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

7 Comments

17 Comments

Related