0

I want to read the files present in this folder - uwyo and read this as a data frame while skipping the rows in between the observation data. I want to read every observation where it starts from the keyword- pressure.

For that I thought of using pandas and then start searching for the word 'pressure', but I got the following error.

import pandas as pd
import glob
import numpy as np
import matplotlib.dates as mdates
import matplotlib.pyplot as plt

dfs = []
for fname in glob.glob('*.txt'):
    df = pd.read_csv(fname,delimiter='\s+',header=None)
ParserError: Error tokenizing data. C error: Expected 9 fields in line 5, saw 11

Is there an efficient way to do this? I want to skip the station information and all those texts present in between.

1 Answer 1

1

Try it as: pd.read_csv(fname, sep='\s+', on_bad_lines='skip', skiprows=4)

This will read the file with a lot of trash though. Also, missing values in the txt file would appear in the wrong column.

I would recommend trying to identify the timestamps you have available and add a column for them, as well as identifying and removing the metadata present in between each period of observations.

This will require some pre-work on the data :D

Edit:

Sorry, forgot to add this as a second line above: (it will filter out most of the trash) df[pd.to_numeric(df['PRES'], errors='coerce').notnull()]

Sign up to request clarification or add additional context in comments.

3 Comments

it worked, but as you said it's unable to filter the metadata between the files.
on_bad_lines='skip' is causing issues. On encountering a column with no data it takes vales from the adjacent columns. It just shifts the columns messing the whole data.
The columns shifting is most probably caused by the regex on delimiter, you need to find a better expression to identify column with missing data. Maybe something like '\s{2,5}' could help. (\s+ is greedy and will try to match as many spaces as it can in one match, \s{2,5} will match from 2 to 5) To clean out the trash rows, i would try to apply the following line for different columns, other than suggested 'PRES', it will remove rows without numerical data: df[pd.to_numeric(df['PRES'], errors='coerce').notnull()]

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.