Parsing a large text file with Python

Question

I have to put the information of a large txt file into a pandas dataframe. Text file is formatted like this (and I can not change it in any way):

o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_0  S_1  C_1 
  foo     bar
     foo_1  foo_2  foo_3   foo_4
      0.5    1.2    3.5     2.4 
        X[m]            Y[m]            Z[m]            alfa[-]        beta[-]
 -2.17142783E-04  3.12000068E-03  3.20351664E-01  3.20366857E+01  3.20366857E+01
 -7.18630964E-04  2.99634764E-03  3.20343560E-01  3.20357573E+01  3.20357573E+01
 -2.85056979E-03 -4.51947006E-03  3.20079900E-01  3.20111805E+01  3.20111805E+01
o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_0  S_2  C_1 
  foo     bar
     foo_1  foo_2  foo_3   foo_4
      0.5    1.2    3.5     2.4 
        X[m]            Y[m]            Z[m]            alfa[-]        beta[-]
 -2.17142783E-04  3.12000068E-03  3.20351664E-01  3.20366857E+01  3.20366857E+01
 -7.18630964E-04  2.99634764E-03  3.20343560E-01  3.20357573E+01  3.20357573E+01
 -2.85056979E-03 -4.51947006E-03  3.20079900E-01  3.20111805E+01  3.20111805E+01
o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_1  S_3  C_1 
  foo     bar
     foo_1  foo_2  foo_3   foo_4
      0.5    1.2    3.5     2.4 
        X[m]            Y[m]            Z[m]            alfa[-]        beta[-]
 -2.17142783E-04  3.12000068E-03  3.20351664E-01  3.20366857E+01  3.20366857E+01
 -7.18630964E-04  2.99634764E-03  3.20343560E-01  3.20357573E+01  3.20357573E+01
 -2.85056979E-03 -4.51947006E-03  3.20079900E-01  3.20111805E+01  3.20111805E+01

The original file have more than 65K lines.

I would like to create a unique dataframe containing the info of that file, including the one contained in the first line after the separators. I wrote a working code:

import os
import pandas as pd

my_path = r"C:\Users\212744206\Desktop\COSO"
my_file= os.path.join(my_path ,'my_file.dat')

istart = False
with open(my_file) as fp:
    for i, line in enumerate(fp):
        if (line[0] != 'o'):
            if line.split()[0][0] == 'Z':
                iZ  = int((line.split()[0]).split('_')[1])
                iS  = int((line.split()[1]).split('_')[1])
                iC  = int((line.split()[2]).split('_')[1])
            elif (line.split()[0] == 'X[m]') or (len(line.split()) == 2) or (len(line.split()) == 4):
                continue
            else:
                dfline = pd.DataFrame(line.split())
                dfline = dfline.transpose()
                dfline.insert(0, column='C' , value=iC)
                dfline.insert(0, column='S' , value=iS)
                dfline.insert(0, column='Z' , value=iZ)

                if istart == False:
                    df_zone = dfline.copy()
                    istart = True
                else:
                    df_zone = df_zone.append(dfline, ignore_index=True, sort=False)

                print(df_zone)

...but it is very slow for my application (the print at the end is obviously for debug reason and I am not going to use it with the large file). How can I write it in a more "pythonic" and efficient way? All suggestions are accepted! Thank you

EDIT: Unfortunately my "useful" data can have 3,4,5 or whatever number of lines... Moreover, I need to parse the lines "Z_0 S_1 C_1" since I need to have an output like this:

   Z  S  C                0                1               2               3               4  
0  0  1  1  -2.17142783E-04   3.12000068E-03  3.20351664E-01  3.20366857E+01  3.20366857E+01  
1  0  1  1  -7.18630964E-04   2.99634764E-03  3.20343560E-01  3.20357573E+01  3.20357573E+01  
2  0  1  1  -2.85056979E-03  -4.51947006E-03  3.20079900E-01  3.20111805E+01  3.20111805E+01  
3  0  2  1  -2.17142783E-04   3.12000068E-03  3.20351664E-01  3.20366857E+01  3.20366857E+01  
4  0  2  1  -7.18630964E-04   2.99634764E-03  3.20343560E-01  3.20357573E+01  3.20357573E+01  
5  0  2  1  -2.85056979E-03  -4.51947006E-03  3.20079900E-01  3.20111805E+01  3.20111805E+01  
6  1  3  1  -2.17142783E-04   3.12000068E-03  3.20351664E-01  3.20366857E+01  3.20366857E+01  
7  1  3  1  -7.18630964E-04   2.99634764E-03  3.20343560E-01  3.20357573E+01  3.20357573E+01  
8  1  3  1  -2.85056979E-03  -4.51947006E-03  3.20079900E-01  3.20111805E+01  3.20111805E+01

Would you be able to include a small sample of what you need the output to looklike? I'm struggling to work out how the input maps to rows and columns that you want? — Robert King
– Robert King, Commented Feb 7, 2019 at 11:10
your examples are very uniform, is your entire file so uniform(same amount of data/headers/etc)? because if it is you could save a lot of time just by skipping uninteresting lines and such — Nullman
– Nullman, Commented Feb 7, 2019 at 11:10
I updated my question with a sample and a detail about the uniformity — L. Winchler
– L. Winchler, Commented Feb 7, 2019 at 11:54
@L.Winchler I have updated my answer in correspondence with your new description of the problem — MPA
– MPA, Commented Feb 8, 2019 at 13:18

MPA · Accepted Answer · 2019-02-08 13:56:04Z

The main performance bottle neck is appending to a dataframe all the time. Instead, you could create a data buffer and expand that buffer once it overflows. The code below generates a synthetic dataset of about 100,000 lines of data, and then parses the corresponding data file:

import pandas as pd
import numpy as np
from itertools import combinations_with_replacement
from scipy.misc import comb
from time import time

np.random.seed(0)

# Array buffer increment size
array_size = 1000
# Data file (output and input)
filename = "stack_output.dat"


def generate_data(m):
    """Generate synthetic (dummy) data to test performance"""

    # Weird string appearing in the example data
    sep_string = "".join(["o--"]*26)
    sep_string += "o\n"

    # Generate ZSC data, which seem to be combinatoric in nature
    x = np.arange(m)
    Ngroups = comb(m, 3, exact=True, repetition=True)

    # For each group of ZSC, generate a random number of lines of data
    # (between 2 and 8 lines)
    Nrows = np.random.randint(low=2, high=8, size=Ngroups)

    # Open file and write data
    with open(filename, "w") as f:
        # Loop over all values of ZSC (000, 001, 010, 011, etc.)
        for n, ZSC in enumerate(combinations_with_replacement(x, 3)):
            # Generate random data
            rand_data = np.random.rand(Nrows[n], 5)
            # Write (meta) data to file
            f.write(sep_string)
            f.write("Z_%d  S_%d  C_%d\n" % ZSC)
            f.write("foo    bar\n")
            f.write("X[m]   Y[m]   Z[m]   alpha[-]   beta[-]\n")
            for data in rand_data:
                f.write("%.8e  %.8e  %.8e  %.8e  %.8e\n" % tuple(data))

    return True


def grow_array(x):
    """Helper function to expand an array"""
    buf = np.zeros((array_size, x.shape[1])) * np.nan
    return np.vstack([x, buf])


def parse_data():
    """Parse the data using a growing buffer"""

    # Number of lines of meta data (i.e. line that don't
    # contain the XYZ alpha beta values
    Nmeta = 3

    # Some counters
    Ndata = 0
    group_index = 0

    # Data buffer
    all_data = np.zeros((array_size, 8)) * np.nan

    # Read filename
    with open(filename, "r") as f:
        # Iterate over all lines
        for i, line in enumerate(f):
            # If we're at that weird separating line, we know we're at the
            # start of a new group of data, defined by Z, S, C
            if line[0] == "o":
                group_index = i
            # If we're one line below the separator, get the Z, S, C values
            elif i - group_index == 1:
                ZSC = line.split()
                # Extract the number from the string
                Z = ZSC[0][2:]
                S = ZSC[1][2:]
                C = ZSC[2][2:]
                ZSC_clean = np.array([Z, S, C])
            # If we're in a line below the meta data, extract the XYZ values
            elif i - group_index > Nmeta:
                # Split the numbers in the line
                data = np.array(line.split(), dtype=float)
                # Check if the data still fits in buffer.
                # If not: expand the buffer
                if Ndata == len(all_data)-1:
                    all_data = grow_array(all_data)
                # Populate the buffer
                all_data[Ndata] = np.hstack([ZSC_clean, data])
                Ndata += 1
    # Convert the buffer to a pandas dataframe (and clip the unpopulated
    # bits of the buffer, which are still NaN)
    df = pd.DataFrame(all_data, columns=("Z", "S", "C", "X", "Y", "Z", "alpha", "beta")).dropna(how="all")
    return df

t0 = time()
generate_data(50)
t1 = time()
data = parse_data()
t2 = time()

print("Data size: \t\t\t %i" % len(data))
print("Rendering data: \t %.3e s" % (t1 - t0))
print("Parsing data: \t\t %.3e s" % (t2 - t1))

Result:

Data size:           99627
Rendering data:      3.360e-01 s
Parsing data:        1.356e+00 s

Is this good enough for your purposes?

Previous answer for reference (which assumed certain structure of the data file):

You can use the skiprows feature in pandas.read_csv. In your example, only the last 3 lines of each multiple of 9 contain useful data, so you can use skiprows along with a function that returns True if the line index is 6, 7, or 8 (starting at 0) for each multiple of 9:

import pandas as pd
filename = "data.dat"

data = pd.read_csv(
    filename, names=("X", "Y", "Z", "alpha", "beta"), delim_whitespace=True,
    skiprows=lambda x: x % 9 < 6,
)
print(data)

Amadan · Accepted Answer · 2019-02-07 11:18:02Z

1

Do not append dataframes. It is a very slow operation. Ideally, I'd do this is two passes: go through the file once to count the lines, then rewind the file, create a dataframe of appropriate size, and fill it in the second pass by direct indexing.

As microoptimisations, notice that you're doing line.split() many times - it should be cached.

answered Feb 7, 2019 at 11:18

Amadan

199k23 gold badges252 silver badges321 bronze badges

2 Comments

Jan Christoph Terasa Over a year ago

Alternatively, one can aggregate into a list (O(1)) and create the DataFrame at the end.

Amadan Over a year ago

@JanChristophTerasa Yup, that's another alternative, though if the file is really really big you might not want to do that (native Python objects take much more memory than numpy's number arrays).

Collectives™ on Stack Overflow

Parsing a large text file with Python

2 Answers 2

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Linked

Related