Python DataFrame Data Analysis of Large Amount of Data from a Text File

Question

I have the following code:

datadicts = [ ]
with open("input.txt") as f:
    for line in f:
        datadicts.append({'col1': line[':'], 'col2': line[':'], 'col3': line[':'], 'col4': line[':']})

df = pd.DataFrame(datadicts)
df = df.drop([0])
print(df)

I am using a text file (that is not formatted) to pull chunks of data from. When the text file is opened, it looks something like this, except on a way bigger scale:

00 2381    1.3 3.4 1.8 265879 Name 
34 7879    7.6 4.2 2.1 254789 Name 
45 65824   2.3 3.4 1.8 265879 Name 
58 3450    1.3 3.4 1.8 183713 Name 
69 37495   1.3 3.4 1.8 137632 Name 
73 458913  1.3 3.4 1.8 138024 Name

Here are the things I'm having trouble doing with this data:

I only need the second, third, sixth, and seventh columns of data. The issue with this one, I believe I've solved with my code above by reading the individual lines and creating a dataframe with the columns necessary. I am open to suggestions if anyone has a better way of doing this.
I need to skip the first row of data. This one, the open feature doesn't have a skiprows attribute, so when I drop the first row, I also lose my index starting at 0. Is there any way around this?
I need the resulting dataframe to look like a nice clean dataframe. As of right now, it looks something like this:

Col1   Col2   Col3 Col4
2381    3.4 265879 Name 
7879    4.2 254789 Name 
65824   3.4 265879 Name 
3450    3.4 183713 Name 
37495   3.4 137632 Name 
458913  3.4 138024 Name

Everything is right-aligned under the column and it looks strange. Any ideas how to solve this?

I also need to be able to perform Statistic Analysis on the columns of data, and to be able to find the Name with the highest data and the lowest data, but for some reason, I always get errors because I think that, even though I've got all the data set up as a dataframe, the values inside the dataframe are reading as objects instead of integers, strings, floats, etc.

So, if my data is not analyzable using Python functions, does anyone know how I can fix this to make the data be able to run correctly?

Any help would be greatly appreciated. I hope I've laid out all of my needs clearly. I am new to Python, and I'm not sure if I'm using all the proper terminology.

Have you tried out the ‘pd.read_csv’ function? Might be very useful here. — s3dev
– s3dev, Commented May 3, 2020 at 19:27
Just as @S3DEV mentioned, read_csv should do a lot good here, you can try using it as - df = pd.read_csv("input.txt", sep=' ', skipinitialspace=True, names=['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7']) — Cavin Dsouza
– Cavin Dsouza, Commented May 3, 2020 at 20:15
I did attempt using pd.read_csv at first, but I ran into the issue that I couldn't get the data I needed to import from the file since the file was not formatted by individual columns. It's just lines and lines of data. so I had to specify which sections I needed from each line using the loop function. — Rose16
– Rose16, Commented May 4, 2020 at 2:05
@Rose16 so basically, there were no column names provided that you could easily filter out? u tried the code I suggested on top? — Cavin Dsouza
– Cavin Dsouza, Commented May 4, 2020 at 8:10

s3dev · Accepted Answer · 2020-05-04 08:40:05Z

You can use the pandas.read_csv() function to accomplish this very easily.

txt2pd.txt is a text file containing a copy/paste from your source above
sep is using a regex pattern to delimit by one or more consecutive spaces
names uses a list to create your column names
skiprows skips the first row, per your requirements

Example:

keep = ['col1', 'col3', 'col5', 'col6']
df = pd.read_csv('txt2pd.txt', 
                 sep='\s+', 
                 names=['col0', 'col1', 'col2', 'col3', 'col4', 'col5', 'col6'], 
                 skiprows=1)
df = df[keep]

Output:

     col1  col3    col5  col6
0    7879   4.2  254789  Name
1   65824   3.4  265879  Name
2    3450   3.4  183713  Name
3   37495   3.4  137632  Name
4  458913   3.4  138024  Name

Sample Analysis:

Using df.describe() you can output a simple, high-level analysis. (Anything further should be the subject of a new question.)

                col1      col3           col5
count       5.000000  5.000000       5.000000
mean   114712.200000  3.560000  196007.400000
std    194048.545838  0.357771   61762.106621
min      3450.000000  3.400000  137632.000000
25%      7879.000000  3.400000  138024.000000
50%     37495.000000  3.400000  183713.000000
75%     65824.000000  3.400000  254789.000000
max    458913.000000  4.200000  265879.000000

I've got my code working smoothly finally! I upvoted your answer! Sorry, I'm new to this platform.
I will accept the answer. By any chance, can you help me in my other question, too? stackoverflow.com/questions/61687406/…
Thank you. Yes, of course. I’ll have a more in-depth look on Monday, if that’s ok.
That's okay. It's due in a few hours, and this is the last thing I'm missing. No worries! Thank you for all of your help!

Collectives™ on Stack Overflow

Python DataFrame Data Analysis of Large Amount of Data from a Text File

1 Answer 1

Example:

Output:

Sample Analysis:

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Example:

Output:

Sample Analysis:

5 Comments

Linked

Related