Reading files in parallel in python

Question

I have a bunch of files (almost 100) which contain data of the format: (number of people) \t (average age)

These files were generated from a random walk conducted on a population of a certain demographic. Each file has 100,000 lines, corresponding to the average age of populations of sizes from 1 to 100,000. Each file corresponds to a different locality in a third world country. We will be comparing these values to the average ages of similar sized localities in a developed country.

What I want to do is,

for each i (i ranges from 1 to 100,000):
  Read in the first 'i' values of average-age
  perform some statistics on these values

That means, for each run i (where i ranges from 1 to 100,000), read in the first i values of average-age, add them to a list and run a few tests (like Kolmogorov-Smirnov or chi-square)

In order to open all these files in parallel, I figured the best way would be a dictionary of file objects. But I am stuck in trying to do the above operations.

Is my method the best possible one (complexity-wise)?

Is there a better method?

"read the first i average-ages of all files (put them into a list or something"? What does that mean? Does it mean for i in range(100): read i lines from the file? If so, please update your algorithm. — S.Lott
– S.Lott, Commented Jun 2, 2011 at 21:17
If the files are small you would add an overhead to access all of them at the same time because of GIL and the files are in the same harddisk — JBernardo
– JBernardo, Commented Jun 2, 2011 at 21:19
There are 100,000 lines in each file. I want to read the first i files for i ranging from 1 to 100,000 — Craig
– Craig, Commented Jun 2, 2011 at 21:23
I don't think he's using the word "parallel" in the threading sense (Are you, Craig?). — Steven Rumbalski
– Steven Rumbalski, Commented Jun 2, 2011 at 21:25
This may not be the answer you're looking for, but this is the type of thing that relational databases were designed to answer. I would stand up a SQL DB of some sort, load everything in there quickly, and then you'll have much more success without the overhead of repeatedly loading and reading files (even if you did it in a rolling fashion). — nearlymonolith
– nearlymonolith, Commented Jun 2, 2011 at 21:28

inspectorG4dget · Accepted Answer · 2011-06-02 21:28:00Z

3

Actually, it would be possible to hold 10,000,000 lines in memory.

Make a dictionary where the keys are number of people and values are lists of average age where each element of the list comes a different file. Therefore, if there are 100 files, each of your lists will have 100 elements.

This way, you don't need to store the file objects in a dict

Hope this helps

answered Jun 2, 2011 at 21:28

inspectorG4dget

115k30 gold badges158 silver badges252 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sean McSomething Over a year ago

It sounds like a lot of data but really - it's not. Given that the numbers you're storing are just going to be integers, you're going to be looking at a few 10s of megabytes.

Gregg · Accepted Answer · 2011-06-02 22:55:59Z

Why not take a simple approach:

Open each file sequentially and read its lines to fill an in-memory data structure
Perform statistics on the in-memory data structure

Here is a self-contained example with 3 "files", each containing 3 lines. It uses StringIO for convenience instead of actual files:

#!/usr/bin/env python
# coding: utf-8

from StringIO import StringIO

# for this example, each "file" has 3 lines instead of 100000
f1 = '1\t10\n2\t11\n3\t12'
f2 = '1\t13\n2\t14\n3\t15'
f3 = '1\t16\n2\t17\n3\t18'

files = [f1, f2, f3]

# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []

for i,filename in enumerate(files):
    f = StringIO(filename)
    # f = open(filename, 'r')
    data.append(dict())

    for line in f:
        population, average_age = (int(s) for s in line.split('\t'))
        data[i][population] = average_age

print data

# gather custom statistics on the data

# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'

The output is:

[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old

g.d.d.c · Accepted Answer · 2011-06-02 21:35:35Z

I ... don't know if I like this approach, but it's possible that it could work for you. It has the potential to consume large amounts of memory, but may do what you need it to. I make the assumption that your data files are numbered. If that's not the case this may need adaptation.

# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]

# loop for the number of lines.
for line in range(100000):
  lines = [fh.readline() for fh in handles]

  # Some sort of processing for the list of lines.

That may get close to what you need, but again, I don't know that I like it. If you have any files that don't have the same number of lines this could run into trouble.

Yes, the files have different numbers of lines. This happened because the random walk fizzled out in some runs
@Craig - I just ran a quick test and it looks like readline() will return a blank string when it's reached the end of the file. That would make testing in you processing easy, and doesn't look like it would throw an exception.

Collectives™ on Stack Overflow

Reading files in parallel in python

3 Answers 3

1 Comment

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Related