1

How do I concatenate multiple xlsx files with the same sheet_names. For example, I have 3 xlsx files, Rob_schedule.xlsx, Mike_schdule.xlsx and Jerome_schedule.xlsx.

Each file has the following sheet/tab names : home, office & school.

The code below generates the 3 xlsx files ( you can copy + paste and run to generate the excel files)

##############################Generating the data for Rob_schedule.xlsx########################
import pandas as pd
import numpy as np
df= {
    'Date':[10232020,10242020,10252020,10262020],
    'Class':['AP_Bio','AP_Chem','Physics','History'],
    'Period':[3,1,2,4]}
school = pd.DataFrame(df,columns = ['Date','Class','Period'])
school

df2= {
    'Date':[10232020,10242020,10252020,10262020],
    'Meeting':['MQ1','MQ6','MQ2','MQ8'],
    'Lunch':[1,1,1,3],
    'code':['java','python','C','C++']}
office = pd.DataFrame(df2,columns = ['Date','Meeting','Lunch','code'])
office

df3= {
    'cooking':['C','B','D','B'],
    'Laundry':['color','white','White','color'],
    'cleaning':['balcony','garage','restroom','bathroom']}
home = pd.DataFrame(df3,columns = ['cooking','Laundry','cleaning'])
home

import pandas as pd
#initialze the excel writer
writer = pd.ExcelWriter('Rob_schedule.xlsx', engine='xlsxwriter')

#store your dataframes in a  dict, where the key is the sheet name you want
frames = {'home':home, 'office':office,
        'school':school}

#now loop thru and put each on a specific sheet
for sheet, frame in  frames.items(): 
    frame.to_excel(writer, sheet_name = sheet,index = False)

#critical last step
writer.save()

################################ generating Mike_schedule.xlsx###################################
import pandas as pd
import numpy as np
df= {
    'Date':[10232020,10242020,10252020,10262020],
    'Class':['AP_Bio','AP_Chem','Physics','History'],
    'Period':[3,1,2,4]}
school = pd.DataFrame(df,columns = ['Date','Class','Period'])
school

df2= {
    'Date':[10232020,10242020,10252020,10262020],
    'Meeting':['MQ1','MQ2','MQ4','MQ5'],
    'Lunch':[1,1,1,3],
    'code':['javascript','R','C','C++']}
office = pd.DataFrame(df2,columns = ['Date','Meeting','Lunch','code'])
office

df3= {
    
    'cooking':['A','B','D','B'],
    'Laundry':['color','white','white','color'],
    'cleaning':['patio','garage','living_room','bathroom']}
home = pd.DataFrame(df3,columns = ['cooking','Laundry','cleaning'])
home


#initialze the excel writer
writer = pd.ExcelWriter('Mike_schedule.xlsx', engine='xlsxwriter')

#store your dataframes in a  dict, where the key is the sheet name you want
frames = {'home':home, 'office':office,
        'school':school}

#now loop thru and put each on a specific sheet
for sheet, frame in  frames.items(): # .use .items for python 3.X
    frame.to_excel(writer, sheet_name = sheet,index = False)

#critical last step
writer.save()

######################### Generate Jerome schedule###########################################
df= {
    'Date':[10232020,10242020,10252020,10262020],
    'Class':['French','Math','Physics','History'],
    'Period':[3,1,2,4]}
school = pd.DataFrame(df,columns = ['Date','Class','Period'])
school

df2= {
    'Date':[10232020,10242020,10252020,10262020],
    'Meeting':['MQ1','MQ2','MQ4','MQ5'],
    'Lunch':[1,1,1,3],
    'code':['javascript','python','R','C++']}
office = pd.DataFrame(df2,columns = ['Date','Meeting','Lunch','code'])
office

df3= {
    
    'cooking':['X','B','D','C'],
    'Laundry':['color','white','white','color'],
    'cleaning':['patio','garage','living_room','bathroom']}
home = pd.DataFrame(df3,columns = ['cooking','Laundry','cleaning'])
home

import pandas as pd
#initialze the excel writer
writer = pd.ExcelWriter('Jerome_schedule.xlsx', engine='xlsxwriter')

#store your dataframes in a  dict, where the key is the sheet name you want
frames = {'home':home, 'office':office,
        'school':school}

#now loop thru and put each on a specific sheet
for sheet, frame in  frames.items(): # .use .items for python 3.X
    frame.to_excel(writer, sheet_name = sheet,index = False)

#critical last step
writer.save()

I want to

  1. concatenate the corresponding sheets/tabs :home, office, and school for Rob_schedule.xlsx,Mike_schedule.xlsx & Jerome_schedule.xlsx

  2. export the concatenated dataframes as family_schedule.xlsx with home, office and school tabs

My attempt:

# This code concatenates all the tabs into one tab, but what I want is to concatenate all by their corresponding sheet/tab names

import pandas as pd

path = os.chdir(r'mypath\\')
files = os.listdir(path)
files
# pull files with `.xlsx` extension
excel_files = [file for file in files if '.xlsx' in file]
excel_files

def create_df_from_excel(file_name):
    file = pd.ExcelFile(file_name)

    names = file.sheet_names

    return pd.concat([file.parse(name) for name in names])

df = pd.concat(
    [create_df_from_excel(xl) for xl in excel_files]
)

# save the data frame
writer = pd.ExcelWriter('family_reschedule.xlsx')
df.to_excel(writer, '')
writer.save()

2
  • Will each workbook always (guaranteed) have the same sheet names, column headings, etc? Do you want the new sheets in the new workbook to have an additional column to say which workbook the rows originally came from? Commented Nov 20, 2021 at 13:17
  • @MatBailie, Yes, they all have the same sheet names, column headings. I want the new sheets in the new workbook, you can add a column to say with workbook the rows originally came from Commented Nov 20, 2021 at 21:40

2 Answers 2

4

I would iterate over each file, and then over each worksheet, adding each sheet to a different list based on the sheet name.

Then you'll have a structure like...

{
  'sheet1': [df_file1_sheet1, df_file2_sheet1, df_file3_sheet1],
  'sheet2': [df_file1_sheet2, df_file2_sheet2, df_file3_sheet2],
  'sheet3': [df_file1_sheet3, df_file2_sheet3, df_file3_sheet3],
}

Then concatenate each list in to a single dataframe, them write the three dataframes to an excel file.

# This part is just your own code, I've added it here because you
# couldn't figure out where `excel_files` came from
#################################################################

import os
import pandas as pd

path = os.chdir(r'mypath\\')
files = os.listdir(path)
files
# pull files with `.xlsx` extension
excel_files = [file for file in files if '.xlsx' in file]
excel_files 

# This part is my actual answer
###############################

from collections import defaultdict

worksheet_lists = defaultdict(list)
for file_name in excel_files:
    workbook = pd.ExcelFile(file_name)
    for sheet_name in workbook.sheet_names:
        worksheet = workbook.parse(sheet_name)
        worksheet['source'] = file_name
        worksheet_lists[sheet_name].append(worksheet)

worksheets = {
    sheet_name: pd.concat(sheet_list)
        for (sheet_name, sheet_list)
        in worksheet_lists.items()
}

writer = pd.ExcelWriter('family_reschedule.xlsx')

for sheet_name, df in worksheets.items():
    df.to_excel(writer, sheet_name=sheet_name, index=False)

writer.save()
Sign up to request clarification or add additional context in comments.

4 Comments

can you explain your code, sheet1,sheet2 and sheet3 in the first block are not saved/assigned as a variable. What's the link between the first block of code and the second block?
can you explain your code, sheet1,sheet2 and sheet3 in the first block are not saved/assigned as a variable. In the for file_name in excel_files:, where is excel_files coming from. Also, What's the link between the first block of code and the second block?
@nasa313 The first block isn't code, it's an example of the data structure you'll get if you follow the logic in the words immediately before it. As for where excel_files comes from, that's in your own code. The answer I was limited to your question; how to process the files. I didn't see the point in repeating the code to identify the files, as you always had wirtten that. Run your code up to the part where you set excel_files, then carry on with my answer.
@parfait Data has to come from somewhere before it reaches where it belongs ;)
2

Consider building a list of concatenated data frames with list/dict comprehensions by running an outer iteration across sheet names and inner iteration across workbooks:

import pandas as pd

path = "/path/to/workbooks"
workbooks = [f for f in os.listdir(path) if f.endswith(".xlsx")]
sheets = ["home", "office", "school"]

df_dicts = {
    sh: pd.concat(
        [pd.read_excel(os.path.join(path, wb), sheet_name=sh) 
         for wb in workbooks]
    )
    for sh in sheets
}

Then, export to single file:

with pd.ExcelWriter('family_reschedule.xlsx') as writer:
   for sh, df in df_dict.items(): 
       df.to_excel(writer, sheet_name=sh, index=False)

   writer.save()

3 Comments

Presumes knowledge of which sheets exist, which was only confirmed by the op after I answered ;)
@Parfait. I like your answer, however, does workbooks = ["Rob_schedule.xlsx", "Mike_schdule.xlsx", "Jerome_schedule.xlsx"] sheets = ["home", "office", "school"] come after this part of my code import pandas as pd path = os.chdir(r'mypath\\') files = os.listdir(path) files # pull files with .xlsx` extension excel_files = [file for file in files if '.xlsx' in file] excel_files` ?
@nasa313 do you even know what your own code was doing? This answer parses the directory specified in path and stores the matching files list in workbooks, your code also parses the directory stored in path and stores the matching files list in excel_files. That's pretty basic stuff.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.