Revisions to Processing CSV files with filtering

added 13 characters in body

Source Link

edited Jan 29, 2021 at 9:09

41.7k
7
70
134

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path

file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it simpler,more memory efficient by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv

def filter_column(files, column, value):
    out_file_name = ...
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file, sep=";")
        for file_name in files:
            with open(file_name) as in_file:
                reader = csv.reader(in_file, sep=";")
                col = next(reader).index(column)
                writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

                for row in reader:
                    if row[col] != value:
                        csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd

files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, index=False, sep=";")

This gives you the full ease of using pandas and splits the task into batches that fit into memory behind the scenes.

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path

file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it simpler, by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv

def filter_column(files, column, value):
    out_file_name = ...
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file, sep=";")
        for file_name in files:
            with open(file_name) as in_file:
                reader = csv.reader(in_file, sep=";")
                col = next(reader).index(column)
                writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

                for row in reader:
                    if row[col] != value:
                        csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd

files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, index=False, sep=";")

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path

file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it more memory efficient by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv

def filter_column(files, column, value):
    out_file_name = ...
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file, sep=";")
        for file_name in files:
            with open(file_name) as in_file:
                reader = csv.reader(in_file, sep=";")
                col = next(reader).index(column)
                writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

                for row in reader:
                    if row[col] != value:
                        csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd

files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, index=False, sep=";")

This gives you the full ease of using pandas and splits the task into batches that fit into memory behind the scenes.

added 13 characters in body

Source Link

edited Jan 29, 2021 at 8:56

Graipher

41.7k
7
70
134

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Pathpathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path

file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it simpler, by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv

def filter_column(files, column, value):
    out_file_name = ...
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file, sep=";")
        for file_name in files:
            with open(file_name) as in_file:
                reader = csv.reader(in_file, sep=";")
                col = next(reader).index(column)
                writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

                for row in reader:
                    if row[col] != value:
                        csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd

files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, index=False, sep=";")

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path

file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it simpler, by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv

def filter_column(files, column, value):
    out_file_name = ...
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file, sep=";")
        for file_name in files:
            with open(file_name) as in_file:
                reader = csv.reader(in_file, sep=";")
                col = next(reader).index(column)
                writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

                for row in reader:
                    if row[col] != value:
                        csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd

files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, sep=";")

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path

file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it simpler, by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv

def filter_column(files, column, value):
    out_file_name = ...
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file, sep=";")
        for file_name in files:
            with open(file_name) as in_file:
                reader = csv.reader(in_file, sep=";")
                col = next(reader).index(column)
                writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

                for row in reader:
                    if row[col] != value:
                        csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd

files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, index=False, sep=";")

Source Link

answered Jan 29, 2021 at 8:51

Graipher

41.7k
7
70
134

First off, what you currently have is perfectly fine. The only thing I would suggest is to use pathlib.Path instead of manually using format and consistently follow the PEP8 naming scheme by using lower_case and having spaces after commas in argument lists:

from pathlib import Path

file_name = Path(f).with_suffix(".txt")

I see two ways you could take this code. One direction is making it simpler, by using normal Python to parse the file. This would allow you to make the code fully streaming and process all files in one go:

import csv

def filter_column(files, column, value):
    out_file_name = ...
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file, sep=";")
        for file_name in files:
            with open(file_name) as in_file:
                reader = csv.reader(in_file, sep=";")
                col = next(reader).index(column)
                writer.writerows(row for row in reader if row[col] != value)

This has almost no memory consumption, due to the writerows, which probably consumes the generator expression after it. If you want it fully independent of your memory (as long as a row fits into memory):

                for row in reader:
                    if row[col] != value:
                        csv_writer.writerow(row)

The other possibility is to go parallel and distributed and use something like dask:

import dask.dataframe as dd

files = "file1.csv", "file2.csv"
df = dd.read_csv(files, sep=";")
df_out = df[df.LIB_SEGM_SR_ET != "PARTICULIERS"]
file_name = ...
df_out.compute().to_csv(file_name, sep=";")

Stack Exchange Network

Return to Answer