Pythonic way to populate numpy array

Question

I find myself parsing lots of data files (usually in a .csv file or similar) using the csv reader and a for loop to iterate over every line. The data is usually a table of floats so for example.

reader = csv.reader(open('somefile.csv'))
header = reader.next()

res_list = [list() for i in header]    

for line in reader:
  for i in range(len(line)):
    res_list[i].append(float(line[i]))

result_dict = dict(zip(header,res_list)) #so we can refer by column title

This is a ok way to populate so I get each column as a separate list however, I would prefer that the default data container for lists of items (and nested lists) be numpy arrays, since 99 times out 100 the numbers get pumped into various processing scripts/functions and having the power of numpy lists makes my life easier.

The numpy append(arr, item) doesn't append in-place and therefore would require re-creating arrays for every point in the table (which is slow and unneccesary). I could also iterate over the list of data-columns and wrap them into an array after I'm done (which is what I've been doing), but sometimes it isn't so clear cut about when I'm done parsing the file and may need to append stuff to the list later down the line anyway.

I was wondering if there is some less-boiler-heavy way (to use the overused phrase "pythonic") to process tables of data in a similar way, or to populate arrays (where the underlying container is a list) dynamically and without copying arrays all the time.

(On another note: its kind of annoying that in general people use columns to organize data but csv reads in rows if the reader incorporated a read_column argument (yes, I know it wouldn't be super efficient), I think many people would avoid having boiler plate code like the above to parse a csv data file. )

Steve Tjoa · Accepted Answer · 2011-09-09 02:42:25Z

8

There is numpy.loadtxt:

X = numpy.loadtxt('somefile.csv', delimiter=',')

Documentation.

Edit: for a list of numpy arrays,

X = [scipy.array(line.split(','), dtype='float') 
     for line in open('somefile.csv', 'r')]

edited Sep 9, 2011 at 2:42

answered Sep 9, 2011 at 0:46

Steve Tjoa

61.4k18 gold badges92 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

crasic Over a year ago

I've considered this before, but it has some issues, in particular it requires that the row length be the same down the array. Although my little snippet assumes the same it doesn't always happen (for instance empty rows to indicate breaks between data collection bursts).

steveha · Accepted Answer · 2011-09-09 02:02:53Z

I think it is difficult to improve very much on what you have. Python lists are relatively cheap to build and append; NumPy arrays are more expensive to create and don't offer a .append() method at all. So your best bet is to build the lists like you already are doing, and then coerce to np.array() when the time comes.

A few small points:

It is slightly faster to use [] to create a list than to call list(). This is such a tiny amount of the runtime of the program that you can feel free to ignore this point.
When you don't actually use the loop index, you can use _ for the variable name to document this.
It's usually better to iterate over a sequence than to find the length of the sequence, build a range(), and then index the sequence a lot. You can use enumerate() to get an index if you also need the index.

Put those together and I think this is a slightly improved version. But it is almost unchanged from your original, and I can't think of any really good improvements.

reader = csv.reader(open('somefile.csv'))
header = reader.next()

res_list = [ [] for _ in header]

for row in reader:
    for i, val in enumerate(row):
        res_list[i].append(float(val))

# build dict so we can refer by column title
result_dict = dict((n, res_list[i]) for i, n in enumerate(header))

"NumPy...don't offer a .append() method at all" NumPy certainly does have an append method. It works more or less as python append except it is not "in place".
@doug: a = np.array(range(3)) succeeds. Then, a.append(4) gives the message AttributeError: 'numpy.ndarray' object has no attribute 'append'. If NumPy arrays have a .append() method, then what am I doing wrong here?
so try this: a = NP.random.randint(0, 10, 5); a = NP.append(a, [2, 3]). So NumPy has an append function not an append method--which makes your statement in your Answer above completely correct!
I do usually try things before I state them in answers. I didn't think to look for a non-method append() function though!

doug · Accepted Answer · 2012-10-25 16:11:24Z

To efficiently load data to a NumPy arraya, i like NumPy's fromiter function.

advantages in this context:

stream-like loading,
pre-specify data type of the reesult array, and
pre-allocate empty output array, which is then populated with the stream from the iterable.

The first of these is inherent--fromiter only accepts data input in iterable form--the last two are managed through the second and third arguments passed to fromiter, dtype, and count.

>>> import numpy as NP
>>> # create some data to load:
>>> import random
>>> source_iterable = (random.choice(range(100)) for c in range(20))

>>> target = NP.fromiter(source_iterable, dtype=NP.int8, count=v.size)
>>> target
      array([85, 28, 37,  4, 23,  5, 47, 17, 78, 40, 28,  5, 69, 47, 15, 92, 
             41, 33, 33, 98], dtype=int8)

If you don't want to load your data using an iterable, you can still pre-allocate memory for your target array, using the NumPy functions empty, and empty_like

>>> source_vec = NP.random.rand(10)
>>> target = NP.empty_like(source_vec)
>>> target[:] = source_vec
>>> target
  array([ 0.5472,  0.5085,  0.0803,  0.4757,  0.4831,  0.3054,  0.1024,  
          0.9073,  0.6863,  0.3575])

Alternatively, you can create an empty, (pre-allocated) array by calling empty, then just passing in the shape you want. This function, by contrast with empty_like, let's you pass in the data type:

>>> target = NP.empty(shape=s.shape, dtype=NP.float)
>>> target
  array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
>>> target[:] = source
>>> target
  array([ 0.5472,  0.5085,  0.0803,  0.4757,  0.4831,  0.3054,  0.1024,  
          0.9073,  0.6863,  0.3575])

Collectives™ on Stack Overflow

Pythonic way to populate numpy array

3 Answers 3

1 Comment

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

Comments

Related