Python Pandas - Parsing rows in a CSV file based on string value

Question

I was wondering if there's a way to iterate through each row in a CSV file using Pandas to identify if a word is found in that row (similar to using grep in Linux systems). It doesn't matter which column the word is found, as long as the word is found, the entire row will be parsed. I discovered the iterrows() function but I read that its very inefficient to use this method if the file is going to contain over 1000 rows and my program could be reading over 100,000 rows. Any suggestions is greatly appreciated!

#Code was tested using Python v3.9.5
import os
import pandas as pd

def parse_row(grep_value):

    global import_file_path
    global export_file_path

    #Initializers loop counter for folder name
    folder_counter = 0
    path = os.path.join(export_file_path, "File Parser Exports")

    #Creates extra directory if current directory exists
    while os.path.isdir(path): 

        #Appends a number to the name of the folder
        folder_counter += 1
        path = os.path.join(export_file_path, "File Parser Exports" + " (" + str(folder_counter) + ")")

    #Creates folder for exports after finding a folder name that is available
    os.mkdir(path)

    #Export file path for parsed file
    full_export_path = path + "\Export.csv"

    file_count = 0    #Initializer for file number of exported files
    tmp_export_path = full_export_path    #Temporary place holder for slicing export path

    #Reads file with headers
    file_data = pd.read_csv(import_file_path, lineterminator='\n')

    #Iterate through file
    for index, row in file_data.iterrows():
        print(index)
        print(row)

    #Checks if export file exists in the newly created directory
    while os.path.isfile(full_export_path):
        
        #Appends a number to the file name
        file_count += 1
        tmp_export_path = tmp_export_path.rsplit('.', 1)[0]
        file_name = "-" + str(file_count) + ".csv"
        full_export_path = tmp_export_path + file_name

    #Exports file after finding a file name that is available
    file_data.to_csv(full_export_path, index=False)

    print()
    print("File(s) exported to \"" + path + "\"")
    print("Successfully completed!")

export_file_path = "C:\\Users\\exportpath"
import_file_path = "C:\\Users\\importpath"
grep_value = "The"

parse_row(grep_value)

EMiller · Accepted Answer · 2021-07-26 20:39:18Z

1

I made a sample dataframe:

dd = pd.DataFrame({'name':['pete','reuben','michelle'],
                   'number':[1,2,3],"lunch":['pizza','hamburger','reuben']})

and I suggest doing this to obtain the matching indices:

dd[dd.columns[dd.dtypes =='object']]\
    .apply(lambda x: ' '.join(x),axis=1).str.contains('reuben')]

from left to right, the code: 1) pulls out the columns that are objects (strings) joins them into one long string, then checks that string for the keyword

to get valid indices:

matches = dd.index[dd[dd.columns[dd.dtypes =='object']]\
    .apply(lambda x: ' '.join(x),axis=1).str.contains('reuben')]

edited Jul 26, 2021 at 20:39

answered Jul 26, 2021 at 20:12

EMiller

8391 gold badge7 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Andy Garcia Over a year ago

I think this answer works for me. If you don't mind, how would I extract the row number from the rows that return True?

EMiller Over a year ago

@AndyGarcia. I hope that I've addressed your question with the last code block, which I just added

Da Song · Accepted Answer · 2021-07-26 20:06:07Z

1

try something like this:

cols = df.columns.tolist()
df['flag'] = False
# iterate by column, faster than iterate rows
for col in df[cols]:
    df['flag'] |= df[col].str.contains('your_str')

answered Jul 26, 2021 at 20:06

Da Song

5582 silver badges7 bronze badges

2 Comments

Andy Garcia Over a year ago

I'm using the string "The" and the code returns an error: "raise AttributeError ("Can only use .str accessor with string values!")"

Da Song Over a year ago

what are the dtypes of your df, you might want to do df = df.astype(str) before everything

Collectives™ on Stack Overflow

Python Pandas - Parsing rows in a CSV file based on string value

2 Answers 2

2 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Related