2

I have a folder that contains multiple excel files

column B.xlsx
column A.xlsx
column C.xlsx
...

**These aren't the actual files names. The actual files names are more specific than this

Each excel file contains data for a single column in a larger dataframe I want to create. The files are formatted like so

column A.xlsx:

Date | ID | Mass | Units
1/21    A   5.10     g
2/21    B   5.12     g
3/21    C   5.11     g

column B.xlsx:

Date | ID | Mass | Units
1/21    A   6.10     g
2/21    B   6.12     g
3/21    C   6.11     g

The large dataframe I'd like to create would look like this:

ID | Column A | Column B | Column C|....
A     5.10        6.10
B     5.12        6.12    
C     5.11        6.11     

Its important that the data is assigned to the correct columns but the only indication as to which column the data corresponds to is in the file name.

I wrote this code which does the job but there has to be a better way

files=glob.glob(r"C:\my\directory/*.xlsx")

bigDF=pd.DataFrame(columns=["ID","A","B","C"])
temp=pd.read_excel(files[0])
bigDF["ID"]=temp["ID"]
for f in files:
    temp=pd.read_excel(f)
    if "A" in f:
        bigDF["A"]=temp["Mass"]
    elif "B" in f: 
        bigDF["B"]=temp["Mass"]
    elif "C" in f:
       bigDF["C"]=temp["Mass"]

2 Answers 2

3
# get your files
files = glob.glob('*.xlsx')
# read your files set the index and locate the mass column
# use list comprehension to iterate through your dfs and concatenate them together
df = pd.concat([pd.read_excel(file).set_index('ID')['Mass'].rename(file.split('.')[0]) for file in files], axis=1)

The list comprehension above is essentially doing:

# iterate through your files
for file in files:
    # read each file into memory, set the index, select the Mass column,
    # then rename the column to the file name
    pd.read_excel(file).set_index('ID')['Mass'].rename(file.split('.'))[0]
Sign up to request clarification or add additional context in comments.

1 Comment

FYI - this assumes that there is one date for each id and that the dates are the same in each file. If that is not the case then you will need to add date to the index as well. set_index(['ID', 'Date'])
0

Using merge and reduce - Idea is to take the subset of all dataframes and then merge all df on the ID column.

from functools import reduce
use_cols =  ['ID', 'Mass']
data_frames =[df1,df2,df3,df4]
data_frames = [df[use_cols] for df in data_frames]
final_df = reduce(lambda left,right,: pd.merge(left,right,on=['ID'],
                                            how='outer'), data_frames)

To directly load dataframes in data_frame list use (Provide the required path inside Path constructor Path('.') means current dir)-

from pathlib import Path
data_frames = [pd.read_excel(xlsx_file,use_cols=['ID', 'Mass']) for xlsx_file in Path('.').glob('**/*.xlsx')] #you can convert this list comprehension to generator if required. 

Finally, to rename columns you can use -

new_cols  = [f'Column {i}' for i in range(len(final_df.columns.values[1:]))]
new_cols.insert(0,'ID')
final_df.columns = new_cols

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.