How to read a text file and make it a dataframe using pandas

Question

I want to read the files present in this folder - uwyo and read this as a data frame while skipping the rows in between the observation data. I want to read every observation where it starts from the keyword- pressure.

For that I thought of using pandas and then start searching for the word 'pressure', but I got the following error.

import pandas as pd
import glob
import numpy as np
import matplotlib.dates as mdates
import matplotlib.pyplot as plt

dfs = []
for fname in glob.glob('*.txt'):
    df = pd.read_csv(fname,delimiter='\s+',header=None)

ParserError: Error tokenizing data. C error: Expected 9 fields in line 5, saw 11

Is there an efficient way to do this? I want to skip the station information and all those texts present in between.

Jordon · Accepted Answer · 2022-11-22 16:28:50Z

1

Try it as: pd.read_csv(fname, sep='\s+', on_bad_lines='skip', skiprows=4)

This will read the file with a lot of trash though. Also, missing values in the txt file would appear in the wrong column.

I would recommend trying to identify the timestamps you have available and add a column for them, as well as identifying and removing the metadata present in between each period of observations.

This will require some pre-work on the data :D

Edit:

Sorry, forgot to add this as a second line above: (it will filter out most of the trash) df[pd.to_numeric(df['PRES'], errors='coerce').notnull()]

edited Nov 22, 2022 at 16:28

answered Nov 22, 2022 at 16:09

Jordon

264 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

The Emerging Star Over a year ago

it worked, but as you said it's unable to filter the metadata between the files.

The Emerging Star Over a year ago

on_bad_lines='skip' is causing issues. On encountering a column with no data it takes vales from the adjacent columns. It just shifts the columns messing the whole data.

Jordon Over a year ago

The columns shifting is most probably caused by the regex on delimiter, you need to find a better expression to identify column with missing data. Maybe something like '\s{2,5}' could help. (\s+ is greedy and will try to match as many spaces as it can in one match, \s{2,5} will match from 2 to 5) To clean out the trash rows, i would try to apply the following line for different columns, other than suggested 'PRES', it will remove rows without numerical data: df[pd.to_numeric(df['PRES'], errors='coerce').notnull()]

Collectives™ on Stack Overflow

How to read a text file and make it a dataframe using pandas

1 Answer 1

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Related