0

As part of a bigger problem I am working on where I have to read in a set of .csv files and manipulate them, generate a new set of .csv files. Everything is smooth for EXCEPT one file: voltvalues.csv. The content of the file looks like this:

... 13986513,6,6/1/2014 12:00:00 AM,248.7 13986513,6,6/1/2014 12:00:05 AM,248.4 13986513,6,6/1/2014 12:00:10 AM,249 13986513,6,6/1/2014 12:00:15 AM,249.3 13986513,6,6/1/2014 12:00:20 AM,249.3 13986513,6,6/1/2014 12:00:25 AM,249.3 ... 13986513,6,6/30/2014 11:55:00 PM,249.3 13986534,6,6/1/2014 12:00:00 AM,249 13986534,6,6/1/2014 12:00:05 AM,249 13986534,6,6/1/2014 12:00:10 AM,249.3 13986534,6,6/1/2014 12:00:15 AM,249.6 ... 13986534,6,6/30/2014 11:55:00 PM,249.7 ...

I am trying to spit out another .csv file: newvolt.csv that has the data in the following format: timestamp,13986513,13986534,... 2014-06-01 12:00:00 PDT,248.7,249.3,... 2014-06-01 12:00:05 PDT,248.4,249,... ... 2014-06-30 23:55:00 PDT,249.3,249.7,... Problem(s) with this file is the size OF voltvalues.csv: 6GB (aboud 1billion rows and 4 columns). so the way I am reading is by something like this:

#meters=[]
real_recorder  = open("newvolt.csv",'w')
with open("voltvalues.csv",'rb') as voltfile:
    voltread = csv.reader(voltfile)
    next(voltread)#skip header
    for line in voltread:
        #convert the data of voltvalues.csv into the format I desire
        #BEST WAY to do it?
        real_recorder.writelines([...])
        #meters.append(line[0])
#print len(meters)
#print len(set(meters))

I know python's datetime module has some methods to change one datetime format to other but in this case, it is very expensive in terms of memory. Any suggestions on the best way to make the whole conversion?

16
  • why are you using if i>0? If you want to skip the header just use next(voltread ) Commented Mar 13, 2015 at 20:34
  • Do you need i and enumerate()... those are costing you some time. Commented Mar 13, 2015 at 20:35
  • @PadraicCunningham the very first line in voltvalues.csv is a header I DON'T want to write to newvolt.csv Commented Mar 13, 2015 at 20:36
  • @NikhilGupta, next(voltread ) then forget the if Commented Mar 13, 2015 at 20:37
  • @Jkdc I hope my above comment answers your question? Commented Mar 13, 2015 at 20:37

2 Answers 2

1

You could scan the file and record the starting offset for each sensor. To read the next value for a given sensor, seek to that offset, read a line and update the offset. With this approach you don't need to keep as much data in local memory, but you are depending on the operating system RAM cache for performance. This may be a good place to use a memory mapped file instead.

It gets more complicated if sensors don't all have the same time value, but this is a start:

open('data.csv','w').write(
"""\
13986513,6,6/1/2014 12:00:00 AM,248.7
13986513,6,6/1/2014 12:00:05 AM,248.4
13986513,6,6/1/2014 12:00:10 AM,249
13986513,6,6/1/2014 12:00:15 AM,249.3
13986513,6,6/1/2014 12:00:20 AM,249.3
13986513,6,6/1/2014 12:00:25 AM,249.3
13986513,6,6/30/2014 11:55:00 PM,249.3
13986534,6,6/1/2014 12:00:00 AM,249
13986534,6,6/1/2014 12:00:05 AM,249
13986534,6,6/1/2014 12:00:10 AM,249.3
13986534,6,6/1/2014 12:00:15 AM,249.6
13986534,6,6/30/2014 11:55:00 PM,249.7\
""")

class ReadSensorLines(object):

    def __init__(self, filename):

        sensor_offsets = {}
        sensors = []

        readfp = open(filename, "rb")
        readfp.readline() # skip header

        # find start of each sensor
        # use readline not iteration so that tell offset is right

        offset = readfp.tell()
        sensor = ''

        while True:
            line = readfp.readline()
            if not line:
                break
            next_sensor = line.split(',', 1)[0]
            if next_sensor != sensor:
                if sensor:
                    sensors.append(sensor)
                    next_offset = readfp.tell()
                    sensor_offsets[sensor] = [offset, next_offset - offset]
                    sensor = next_sensor
                    offset = next_offset
                else:
                    # setup for first sensor
                    sensor = next_sensor
        if next_sensor:
            sensors.append(next_sensor)
            sensor_offsets[next_sensor] = [offset, readfp.tell() - offset]

        self.readfp = readfp
        self.sensor_offsets = sensor_offsets
        self.sensors = sensors

    def read_sensor(self, sensorname):
        pos_data = self.sensor_offsets[sensorname]
        self.readfp.seek(pos_data[0])
        line = self.readfp.readline(pos_data[1])
        pos_data[0] += len(line)
        pos_data[1] -= len(line)
        return line

    @property
    def data_remains(self):
        return any(pos_data[1] for pos_data in self.sensor_offsets.itervalues())

    def close(self):
        self.readfp.close()


sensor_lines = ReadSensorLines("data.csv")
while sensor_lines.data_remains:
    row = []
    for sensor in sensor_lines.sensors:
        sensor_line = sensor_lines.read_sensor(sensor)
        if sensor_line:
            _, _, date, volts = sensor_line.strip().split(',')
            row.append(volts)
        else:
            row.append('')
    row.insert(0, date)
    row[0] = str(datetime.datetime.strptime(row[0],'%m/%d/%Y %H:%M'))
    print ','.join(row)
Sign up to request clarification or add additional context in comments.

Comments

0

It seems like the information in the file in allready sorted in the right order. What you can do is:

Write these "timestamp,13986513,13986534,..." to a file "timestamp.txt". Then grab the value "13986513,6,6/1/2014 12:00:00" in a global string. Then write line away to a file "volt.csv".

Every time the "13986513,6,6/1/2014 12:00:00" matches the previeus one you can add it to create "2014-06-01 12:00:00 PDT,248.7,249.3,...".

But every time you read 1 line and read that one line away. If you keep it in your memory then the program can't handle it anymore.

Have a look at the flush() function. I think you might need that one.

EDIT:

Example code:

class Ice():
    def __init__(self):
        self.Fire()
    def Fire(self):
        with open('File1.txt', 'r') as file1:
            for line in file1:
                # Do something. 
                print('Do something....')
                # Save to file. 
                with open('File2.txt', 'a') as file2:
                    file2.write(line)
                    file2.flush()
                    file2.close()
        file1.close()

# Run class 
Ice()

The thing with big files is, is that is uses a lot of memory. So what you want is to read a line from the file. Process it. Write it away (out of the memory) and take the next line. This way you can process huge files.

What .flush() does is, is it flushes the output. Like in my example you write away a line, but python doesn't write it away right on the moment of the write(). It stores it in a memory. By using .flush() the output is writed to the file. And the line is not kept in memory.

By creating a temporary file you can process all lines without python using the max of its memory.

2 Comments

I am actually a little lost here. What do you mean by information in the file is already sorted in the right order? (you mean vertically?) Also, I am equally little lost on your explanation
Adjusted my post with an example. Hope it helps.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.