Process a huge .csv file in python

Question

As part of a bigger problem I am working on where I have to read in a set of .csv files and manipulate them, generate a new set of .csv files. Everything is smooth for EXCEPT one file: voltvalues.csv. The content of the file looks like this:

... 13986513,6,6/1/2014 12:00:00 AM,248.7 13986513,6,6/1/2014 12:00:05 AM,248.4 13986513,6,6/1/2014 12:00:10 AM,249 13986513,6,6/1/2014 12:00:15 AM,249.3 13986513,6,6/1/2014 12:00:20 AM,249.3 13986513,6,6/1/2014 12:00:25 AM,249.3 ... 13986513,6,6/30/2014 11:55:00 PM,249.3 13986534,6,6/1/2014 12:00:00 AM,249 13986534,6,6/1/2014 12:00:05 AM,249 13986534,6,6/1/2014 12:00:10 AM,249.3 13986534,6,6/1/2014 12:00:15 AM,249.6 ... 13986534,6,6/30/2014 11:55:00 PM,249.7 ...

I am trying to spit out another .csv file: newvolt.csv that has the data in the following format: timestamp,13986513,13986534,... 2014-06-01 12:00:00 PDT,248.7,249.3,... 2014-06-01 12:00:05 PDT,248.4,249,... ... 2014-06-30 23:55:00 PDT,249.3,249.7,... Problem(s) with this file is the size OF voltvalues.csv: 6GB (aboud 1billion rows and 4 columns). so the way I am reading is by something like this:

#meters=[]
real_recorder  = open("newvolt.csv",'w')
with open("voltvalues.csv",'rb') as voltfile:
    voltread = csv.reader(voltfile)
    next(voltread)#skip header
    for line in voltread:
        #convert the data of voltvalues.csv into the format I desire
        #BEST WAY to do it?
        real_recorder.writelines([...])
        #meters.append(line[0])
#print len(meters)
#print len(set(meters))

I know python's datetime module has some methods to change one datetime format to other but in this case, it is very expensive in terms of memory. Any suggestions on the best way to make the whole conversion?

why are you using if i>0? If you want to skip the header just use next(voltread ) — Padraic Cunningham
– Padraic Cunningham, Commented Mar 13, 2015 at 20:34
Do you need i and enumerate()... those are costing you some time. — MrAlexBailey
– MrAlexBailey, Commented Mar 13, 2015 at 20:35
@PadraicCunningham the very first line in voltvalues.csv is a header I DON'T want to write to newvolt.csv — Nikhil Gupta
– Nikhil Gupta, Commented Mar 13, 2015 at 20:36

Nikhil Gupta · Accepted Answer · 2015-03-13 23:25:39Z

You could scan the file and record the starting offset for each sensor. To read the next value for a given sensor, seek to that offset, read a line and update the offset. With this approach you don't need to keep as much data in local memory, but you are depending on the operating system RAM cache for performance. This may be a good place to use a memory mapped file instead.

It gets more complicated if sensors don't all have the same time value, but this is a start:

open('data.csv','w').write(
"""\
13986513,6,6/1/2014 12:00:00 AM,248.7
13986513,6,6/1/2014 12:00:05 AM,248.4
13986513,6,6/1/2014 12:00:10 AM,249
13986513,6,6/1/2014 12:00:15 AM,249.3
13986513,6,6/1/2014 12:00:20 AM,249.3
13986513,6,6/1/2014 12:00:25 AM,249.3
13986513,6,6/30/2014 11:55:00 PM,249.3
13986534,6,6/1/2014 12:00:00 AM,249
13986534,6,6/1/2014 12:00:05 AM,249
13986534,6,6/1/2014 12:00:10 AM,249.3
13986534,6,6/1/2014 12:00:15 AM,249.6
13986534,6,6/30/2014 11:55:00 PM,249.7\
""")

class ReadSensorLines(object):

    def __init__(self, filename):

        sensor_offsets = {}
        sensors = []

        readfp = open(filename, "rb")
        readfp.readline() # skip header

        # find start of each sensor
        # use readline not iteration so that tell offset is right

        offset = readfp.tell()
        sensor = ''

        while True:
            line = readfp.readline()
            if not line:
                break
            next_sensor = line.split(',', 1)[0]
            if next_sensor != sensor:
                if sensor:
                    sensors.append(sensor)
                    next_offset = readfp.tell()
                    sensor_offsets[sensor] = [offset, next_offset - offset]
                    sensor = next_sensor
                    offset = next_offset
                else:
                    # setup for first sensor
                    sensor = next_sensor
        if next_sensor:
            sensors.append(next_sensor)
            sensor_offsets[next_sensor] = [offset, readfp.tell() - offset]

        self.readfp = readfp
        self.sensor_offsets = sensor_offsets
        self.sensors = sensors

    def read_sensor(self, sensorname):
        pos_data = self.sensor_offsets[sensorname]
        self.readfp.seek(pos_data[0])
        line = self.readfp.readline(pos_data[1])
        pos_data[0] += len(line)
        pos_data[1] -= len(line)
        return line

    @property
    def data_remains(self):
        return any(pos_data[1] for pos_data in self.sensor_offsets.itervalues())

    def close(self):
        self.readfp.close()


sensor_lines = ReadSensorLines("data.csv")
while sensor_lines.data_remains:
    row = []
    for sensor in sensor_lines.sensors:
        sensor_line = sensor_lines.read_sensor(sensor)
        if sensor_line:
            _, _, date, volts = sensor_line.strip().split(',')
            row.append(volts)
        else:
            row.append('')
    row.insert(0, date)
    row[0] = str(datetime.datetime.strptime(row[0],'%m/%d/%Y %H:%M'))
    print ','.join(row)

Tenzin · Accepted Answer · 2015-03-13 21:34:25Z

It seems like the information in the file in allready sorted in the right order. What you can do is:

Write these "timestamp,13986513,13986534,..." to a file "timestamp.txt". Then grab the value "13986513,6,6/1/2014 12:00:00" in a global string. Then write line away to a file "volt.csv".

Every time the "13986513,6,6/1/2014 12:00:00" matches the previeus one you can add it to create "2014-06-01 12:00:00 PDT,248.7,249.3,...".

But every time you read 1 line and read that one line away. If you keep it in your memory then the program can't handle it anymore.

Have a look at the flush() function. I think you might need that one.

EDIT:

Example code:

class Ice():
    def __init__(self):
        self.Fire()
    def Fire(self):
        with open('File1.txt', 'r') as file1:
            for line in file1:
                # Do something. 
                print('Do something....')
                # Save to file. 
                with open('File2.txt', 'a') as file2:
                    file2.write(line)
                    file2.flush()
                    file2.close()
        file1.close()

# Run class 
Ice()

The thing with big files is, is that is uses a lot of memory. So what you want is to read a line from the file. Process it. Write it away (out of the memory) and take the next line. This way you can process huge files.

What .flush() does is, is it flushes the output. Like in my example you write away a line, but python doesn't write it away right on the moment of the write(). It stores it in a memory. By using .flush() the output is writed to the file. And the line is not kept in memory.

By creating a temporary file you can process all lines without python using the max of its memory.

I am actually a little lost here. What do you mean by information in the file is already sorted in the right order? (you mean vertically?) Also, I am equally little lost on your explanation

Collectives™ on Stack Overflow

Process a huge .csv file in python

2 Answers 2

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Related