Skip to main content
Tweeted twitter.com/StackCodeReview/status/1431179699238150144
Became Hot Network Question
update formatting, add tags
Source Link

I need to manipulate data found in multiple data files (~100,000 files). One single data file has ~60,000 rows and looks something like this:

 ITEM: TIMESTEP
300
ITEM: NUMBER OF ATOMS
64000
ITEM: BOX BOUNDS xy xz yz pp pp pp
7.1651861306114756e+02 7.6548138693885244e+02 0.0000000000000000e+00
7.1701550555416179e+02 7.6498449444583821e+02 0.0000000000000000e+00
7.1700670287454318e+02 7.6499329712545682e+02 0.0000000000000000e+00
ITEM: ATOMS id mol mass xu yu zu 
1 1 1 731.836 714.006 689.252 
5 1 1 714.228 705.453 732.638 
6 2 1 736.756 704.069 693.386 
10 2 1 744.066 716.174 708.793 
11 3 1 715.253 679.036 717.336 
.  . .  .       .       .
.  . .  .       .       .
.  . .  .       .       .

I need to extract the x coordinate of the first 20,000 lines and group it together with the x coordinates found in the other data files.

Here is the working code:

import numpy as np
import glob
import natsort
import pandas as pd

data = []


filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))
for f in filenames:
    files = pd.read_csv(f,delimiter=' ', dtype=float, skiprows=8,usecols=[3]).values
    data.append(files)

lines = 20000

x_pos = np.zeros((len(data),lines))

for i in range(0,len(data)):
    for j in range(0,lines):
        x_pos[i][j]= data[i][j]


np.savetxt('x_position.txt',x_pos,delimiter=' ')

The problem is of course the time it will take to do this for all the 100,000 files. I was able to significantly reduce the time by switching from np.loadtxtnp.loadtxt to pandas.read_csvpandas.read_csv, however is still too inefficient. Is there a better approach to this? I read that maybe using I/O streams can reduce the time but I am not familiar with that procedure. Any suggestions?

I need to manipulate data found in multiple data files (~100,000 files). One single data file has ~60,000 rows and looks something like this:

 ITEM: TIMESTEP
300
ITEM: NUMBER OF ATOMS
64000
ITEM: BOX BOUNDS xy xz yz pp pp pp
7.1651861306114756e+02 7.6548138693885244e+02 0.0000000000000000e+00
7.1701550555416179e+02 7.6498449444583821e+02 0.0000000000000000e+00
7.1700670287454318e+02 7.6499329712545682e+02 0.0000000000000000e+00
ITEM: ATOMS id mol mass xu yu zu 
1 1 1 731.836 714.006 689.252 
5 1 1 714.228 705.453 732.638 
6 2 1 736.756 704.069 693.386 
10 2 1 744.066 716.174 708.793 
11 3 1 715.253 679.036 717.336 
.  . .  .       .       .
.  . .  .       .       .
.  . .  .       .       .

I need to extract the x coordinate of the first 20,000 lines and group it together with the x coordinates found in the other data files.

Here is the working code:

import numpy as np
import glob
import natsort
import pandas as pd

data = []


filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))
for f in filenames:
    files = pd.read_csv(f,delimiter=' ', dtype=float, skiprows=8,usecols=[3]).values
    data.append(files)

lines = 20000

x_pos = np.zeros((len(data),lines))

for i in range(0,len(data)):
    for j in range(0,lines):
        x_pos[i][j]= data[i][j]


np.savetxt('x_position.txt',x_pos,delimiter=' ')

The problem is of course the time it will take to do this for all the 100,000 files. I was able to significantly reduce the time by switching from np.loadtxt to pandas.read_csv, however is still too inefficient. Is there a better approach to this? I read that maybe using I/O streams can reduce the time but I am not familiar with that procedure. Any suggestions?

I need to manipulate data found in multiple data files (~100,000 files). One single data file has ~60,000 rows and looks something like this:

 ITEM: TIMESTEP
300
ITEM: NUMBER OF ATOMS
64000
ITEM: BOX BOUNDS xy xz yz pp pp pp
7.1651861306114756e+02 7.6548138693885244e+02 0.0000000000000000e+00
7.1701550555416179e+02 7.6498449444583821e+02 0.0000000000000000e+00
7.1700670287454318e+02 7.6499329712545682e+02 0.0000000000000000e+00
ITEM: ATOMS id mol mass xu yu zu 
1 1 1 731.836 714.006 689.252 
5 1 1 714.228 705.453 732.638 
6 2 1 736.756 704.069 693.386 
10 2 1 744.066 716.174 708.793 
11 3 1 715.253 679.036 717.336 
.  . .  .       .       .
.  . .  .       .       .
.  . .  .       .       .

I need to extract the x coordinate of the first 20,000 lines and group it together with the x coordinates found in the other data files.

Here is the working code:

import numpy as np
import glob
import natsort
import pandas as pd

data = []


filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))
for f in filenames:
    files = pd.read_csv(f,delimiter=' ', dtype=float, skiprows=8,usecols=[3]).values
    data.append(files)

lines = 20000

x_pos = np.zeros((len(data),lines))

for i in range(0,len(data)):
    for j in range(0,lines):
        x_pos[i][j]= data[i][j]


np.savetxt('x_position.txt',x_pos,delimiter=' ')

The problem is of course the time it will take to do this for all the 100,000 files. I was able to significantly reduce the time by switching from np.loadtxt to pandas.read_csv, however is still too inefficient. Is there a better approach to this? I read that maybe using I/O streams can reduce the time but I am not familiar with that procedure. Any suggestions?

Source Link

Reading 100,000 data files in Python

I need to manipulate data found in multiple data files (~100,000 files). One single data file has ~60,000 rows and looks something like this:

 ITEM: TIMESTEP
300
ITEM: NUMBER OF ATOMS
64000
ITEM: BOX BOUNDS xy xz yz pp pp pp
7.1651861306114756e+02 7.6548138693885244e+02 0.0000000000000000e+00
7.1701550555416179e+02 7.6498449444583821e+02 0.0000000000000000e+00
7.1700670287454318e+02 7.6499329712545682e+02 0.0000000000000000e+00
ITEM: ATOMS id mol mass xu yu zu 
1 1 1 731.836 714.006 689.252 
5 1 1 714.228 705.453 732.638 
6 2 1 736.756 704.069 693.386 
10 2 1 744.066 716.174 708.793 
11 3 1 715.253 679.036 717.336 
.  . .  .       .       .
.  . .  .       .       .
.  . .  .       .       .

I need to extract the x coordinate of the first 20,000 lines and group it together with the x coordinates found in the other data files.

Here is the working code:

import numpy as np
import glob
import natsort
import pandas as pd

data = []


filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))
for f in filenames:
    files = pd.read_csv(f,delimiter=' ', dtype=float, skiprows=8,usecols=[3]).values
    data.append(files)

lines = 20000

x_pos = np.zeros((len(data),lines))

for i in range(0,len(data)):
    for j in range(0,lines):
        x_pos[i][j]= data[i][j]


np.savetxt('x_position.txt',x_pos,delimiter=' ')

The problem is of course the time it will take to do this for all the 100,000 files. I was able to significantly reduce the time by switching from np.loadtxt to pandas.read_csv, however is still too inefficient. Is there a better approach to this? I read that maybe using I/O streams can reduce the time but I am not familiar with that procedure. Any suggestions?