0

i am trying to merge around 5000 csv sheets to one csv, the structure of the individual csv files are the same, so the code should be simple, however I kept getting an error message of "file not found".

here is the code:

csv_paths = set(glob.glob("folder_containing_csvs/*.csv"))
full_csv_path = "folder_containing_csvs/full_df.csv"
csv_paths -= set([full_csv_path])
for csv_path in csv_paths:
    print("csv_path", csv_path)
    df = pd.read_csv(csv_path, sep="\t")
    df[sorted(list(df.columns.values))].to_csv(full_csv_path, mode="a", header=not 
os.path.isfile(full_csv_path), sep="\t", index=False)
full_df = pd.read_csv(full_csv_path, sep="\t", encoding='utf-8')
full_df

the code resulted error messages as following:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-47-11ffadd03e3e> in <module>
----> 1 full_df = pd.read_csv(full_csv_path, sep="\t", encoding='utf-8')
      2 full_df

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer,
sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, type, 
engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, 
nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, 
infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, 
chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, 
escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, 
low_memory, memory_map, float_precision)
    686     )
    687 
--> 688     return _read(filepath_or_buffer, kwds)
    689 
    690 

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    452 
    453     # Create the parser.
--> 454     parser = TextFileReader(fp_or_buf, **kwds)
    455 
    456     if chunksize or iterator:

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    946             self.options["has_index_names"] = kwds["has_index_names"]
    947 
--> 948         self._make_engine(self.engine)
    949 
    950     def close(self):

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1178     def _make_engine(self, engine="c"):
   1179         if engine == "c":
-> 1180             self._engine = CParserWrapper(self.f, **self.options)
   1181         else:
   1182             if engine == "python":

~/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1991         if kwds.get("compression") is None and encoding:
   1992             if isinstance(src, str):
-> 1993                 src = open(src, "rb")
   1994                 self.handles.append(src)
   1995 

FileNotFoundError: [Errno 2] No such file or directory: 'folder_containing_csvs/full_df.csv'
1
  • If they are csv files, why don't you just open('merge.csv','w').write(open('file1.csv').read()+open('file2.csv').read()). If there is a header, then remove the header first. Commented Apr 11, 2021 at 22:47

2 Answers 2

1

The paths provided by glob are relative to the script's execution location.

If you have a file structure like this:

~/code/ |
       | merge.py
       | folder_containing_csvs/  |
                                  | file1.csv
                                  | file2.csv

The merge.py file MUST be executed from the /code folder.

e.g.

~/code$ python merge.py

Doing something like

~/$ python ./code/merge.py

Will result in

NotFoundError: [Errno 2] No such file or directory: 'folder_containing_csvs/full_df.csv'

Sign up to request clarification or add additional context in comments.

1 Comment

after moving data to /code folder, the code works perfectly. Thank you for explaining the necessary file structure for using glob
1

Try this:

loc_path = /path/to/folder/of/csv's
files = os.listdir(loc_path)
files = [file for file in files if '.csv' in file]

# now load them into a list
dfs = []
for file in files:
    dfs.append(pd.read_csv(loc_path+file), sep='\t')

# concat the dfs list:

df = pd.concat(dfs)
# Send this df.to_csv at location of your choice.

Just read the 5000 csv sheets part. How many rows are you expecting?

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.