I have to put the information of a large txt file into a pandas dataframe. Text file is formatted like this (and I can not change it in any way):
o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_0 S_1 C_1
foo bar
foo_1 foo_2 foo_3 foo_4
0.5 1.2 3.5 2.4
X[m] Y[m] Z[m] alfa[-] beta[-]
-2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
-7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
-2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_0 S_2 C_1
foo bar
foo_1 foo_2 foo_3 foo_4
0.5 1.2 3.5 2.4
X[m] Y[m] Z[m] alfa[-] beta[-]
-2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
-7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
-2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o--o
Z_1 S_3 C_1
foo bar
foo_1 foo_2 foo_3 foo_4
0.5 1.2 3.5 2.4
X[m] Y[m] Z[m] alfa[-] beta[-]
-2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
-7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
-2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
The original file have more than 65K lines.
I would like to create a unique dataframe containing the info of that file, including the one contained in the first line after the separators. I wrote a working code:
import os
import pandas as pd
my_path = r"C:\Users\212744206\Desktop\COSO"
my_file= os.path.join(my_path ,'my_file.dat')
istart = False
with open(my_file) as fp:
for i, line in enumerate(fp):
if (line[0] != 'o'):
if line.split()[0][0] == 'Z':
iZ = int((line.split()[0]).split('_')[1])
iS = int((line.split()[1]).split('_')[1])
iC = int((line.split()[2]).split('_')[1])
elif (line.split()[0] == 'X[m]') or (len(line.split()) == 2) or (len(line.split()) == 4):
continue
else:
dfline = pd.DataFrame(line.split())
dfline = dfline.transpose()
dfline.insert(0, column='C' , value=iC)
dfline.insert(0, column='S' , value=iS)
dfline.insert(0, column='Z' , value=iZ)
if istart == False:
df_zone = dfline.copy()
istart = True
else:
df_zone = df_zone.append(dfline, ignore_index=True, sort=False)
print(df_zone)
...but it is very slow for my application (the print at the end is obviously for debug reason and I am not going to use it with the large file). How can I write it in a more "pythonic" and efficient way? All suggestions are accepted! Thank you
EDIT: Unfortunately my "useful" data can have 3,4,5 or whatever number of lines... Moreover, I need to parse the lines "Z_0 S_1 C_1" since I need to have an output like this:
Z S C 0 1 2 3 4
0 0 1 1 -2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
1 0 1 1 -7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
2 0 1 1 -2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
3 0 2 1 -2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
4 0 2 1 -7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
5 0 2 1 -2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01
6 1 3 1 -2.17142783E-04 3.12000068E-03 3.20351664E-01 3.20366857E+01 3.20366857E+01
7 1 3 1 -7.18630964E-04 2.99634764E-03 3.20343560E-01 3.20357573E+01 3.20357573E+01
8 1 3 1 -2.85056979E-03 -4.51947006E-03 3.20079900E-01 3.20111805E+01 3.20111805E+01