Construct dataframe from multiple files where each file contains column data

Question

I have a folder that contains multiple excel files

column B.xlsx
column A.xlsx
column C.xlsx
...

**These aren't the actual files names. The actual files names are more specific than this

Each excel file contains data for a single column in a larger dataframe I want to create. The files are formatted like so

column A.xlsx:

Date | ID | Mass | Units
1/21    A   5.10     g
2/21    B   5.12     g
3/21    C   5.11     g

column B.xlsx:

Date | ID | Mass | Units
1/21    A   6.10     g
2/21    B   6.12     g
3/21    C   6.11     g

The large dataframe I'd like to create would look like this:

ID | Column A | Column B | Column C|....
A     5.10        6.10
B     5.12        6.12    
C     5.11        6.11

Its important that the data is assigned to the correct columns but the only indication as to which column the data corresponds to is in the file name.

I wrote this code which does the job but there has to be a better way

files=glob.glob(r"C:\my\directory/*.xlsx")

bigDF=pd.DataFrame(columns=["ID","A","B","C"])
temp=pd.read_excel(files[0])
bigDF["ID"]=temp["ID"]
for f in files:
    temp=pd.read_excel(f)
    if "A" in f:
        bigDF["A"]=temp["Mass"]
    elif "B" in f: 
        bigDF["B"]=temp["Mass"]
    elif "C" in f:
       bigDF["C"]=temp["Mass"]

It_is_Chris · Accepted Answer · 2021-04-27 19:12:39Z

3

# get your files
files = glob.glob('*.xlsx')
# read your files set the index and locate the mass column
# use list comprehension to iterate through your dfs and concatenate them together
df = pd.concat([pd.read_excel(file).set_index('ID')['Mass'].rename(file.split('.')[0]) for file in files], axis=1)

The list comprehension above is essentially doing:

# iterate through your files
for file in files:
    # read each file into memory, set the index, select the Mass column,
    # then rename the column to the file name
    pd.read_excel(file).set_index('ID')['Mass'].rename(file.split('.'))[0]

edited Apr 27, 2021 at 19:12

answered Apr 27, 2021 at 19:05

It_is_Chris

14.2k3 gold badges27 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

It_is_Chris Over a year ago

FYI - this assumes that there is one date for each id and that the dates are the same in each file. If that is not the case then you will need to add date to the index as well. set_index(['ID', 'Date'])

Nk03 · Accepted Answer · 2021-04-27 19:46:01Z

Using merge and reduce - Idea is to take the subset of all dataframes and then merge all df on the ID column.

from functools import reduce
use_cols =  ['ID', 'Mass']
data_frames =[df1,df2,df3,df4]
data_frames = [df[use_cols] for df in data_frames]
final_df = reduce(lambda left,right,: pd.merge(left,right,on=['ID'],
                                            how='outer'), data_frames)

To directly load dataframes in data_frame list use (Provide the required path inside Path constructor Path('.') means current dir)-

from pathlib import Path
data_frames = [pd.read_excel(xlsx_file,use_cols=['ID', 'Mass']) for xlsx_file in Path('.').glob('**/*.xlsx')] #you can convert this list comprehension to generator if required.

Finally, to rename columns you can use -

new_cols  = [f'Column {i}' for i in range(len(final_df.columns.values[1:]))]
new_cols.insert(0,'ID')
final_df.columns = new_cols

Collectives™ on Stack Overflow

Construct dataframe from multiple files where each file contains column data

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related