How to merge multiple dataframes

Question

I have different dataframes and need to merge them together based on the date column. If I only had two dataframes, I could use df1.merge(df2, on='date'), to do it with three dataframes, I use df1.merge(df2.merge(df3, on='date'), on='date'), however it becomes really complex and unreadable to do it with multiple dataframes.

All dataframes have one column in common -date, but they don't have the same number of rows nor columns and I only need those rows in which each date is common to every dataframe.

So, I'm trying to write a recursion function that returns a dataframe with all data but it didn't work. How should I merge multiple dataframes then?

I tried different ways and got errors like out of range, keyerror 0/1/2/3 and can not merge DataFrame with instance of type <class 'NoneType'>.

This is the script I wrote:

dfs = [df1, df2, df3] # list of dataframes

def mergefiles(dfs, countfiles, i=0):
    if i == (countfiles - 2): # it gets to the second to last and merges it with the last
        return
    
    dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date')
    return dfm

print(mergefiles(dfs, len(dfs)))

An example: df_1:

May 19, 2017;1,200.00;0.1%
May 18, 2017;1,100.00;0.1%
May 17, 2017;1,000.00;0.1%
May 15, 2017;1,901.00;0.1%

df_2:

May 20, 2017;2,200.00;1000000;0.2%
May 18, 2017;2,100.00;1590000;0.2%
May 16, 2017;2,000.00;1230000;0.2%
May 15, 2017;2,902.00;1000000;0.2%

df_3:

May 21, 2017;3,200.00;2000000;0.3%
May 17, 2017;3,100.00;2590000;0.3%
May 16, 2017;3,000.00;2230000;0.3%
May 15, 2017;3,903.00;2000000;0.3%

Expected merge result:

May 15, 2017;  1,901.00;0.1%;  2,902.00;1000000;0.2%;   3,903.00;2000000;0.3%

Possible duplicate of pandas three-way joining multiple dataframes on columns — ccy
– ccy, Commented Oct 21, 2019 at 9:11

H Ketabi · Accepted Answer · 2023-04-23 09:26:13Z

Short answer

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                            how='outer'), data_frames)

Long answer

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.

Just simply merge with DATE as the index and merge using OUTER method (to get all the data).

import pandas as pd
from functools import reduce

df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')

Now, basically load all the files you have as data frame into a list. And, then merge the files using merge or reduce function.

# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]

Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.

To keep the values that belong to the same date you need to merge it on the DATE

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                            how='outer'), data_frames)

# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['DATE'],
                                            how='outer'), data_frames).fillna('void')

Now, the output will the values from the same date on the same lines.
You can fill the non existing data from different frames for different columns using fillna().

Then write the merged data to the csv file if desired.

pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)

This should give you

DATE VALUE1 VALUE2 VALUE3 ....

what if the join columns are different, does this work? should we go with pd.merge incase the join columns are different?
Just a little note: If you're on python3 you need to import reduce from functools
In addition to what @NicolasMartinez mentioned: from functools import reduce # only in Python 3

Daniel Lopes · Accepted Answer · 2017-06-02 22:59:15Z

64

Looks like the data has the same columns, so you can:

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

merged_df = pd.concat([df1, df2])

answered Jun 2, 2017 at 22:59

Daniel Lopes

8645 silver badges21 bronze badges

3 Comments

Gerard Over a year ago

Bu what if you dont have the same columns?

Pogger Over a year ago

Nice. If have same column to merge on we can use it.

Ziu Over a year ago

concat can auto join by index, so if you have same columns ,set them to index @Gerard

Ismail Hachimi · Accepted Answer · 2020-07-13 21:22:21Z

44

functools.reduce and pd.concat are good solutions but in term of execution time pd.concat is the best.

from functools import reduce
import pandas as pd

dfs = [df1, df2, df3, ...]
nan_value = 0

# solution 1 (fast)
result_1 = pd.concat(dfs, join='outer', axis=1).fillna(nan_value)

# solution 2
result_2 = reduce(lambda df_left,df_right: pd.merge(df_left, df_right, 
                                              left_index=True, right_index=True, 
                                              how='outer'), 
                  dfs).fillna(nan_value)

edited Jul 13, 2020 at 21:22

answered May 27, 2019 at 10:30

Ismail Hachimi

7597 silver badges12 bronze badges

3 Comments

cikatomo Over a year ago

do you use on=...?

spectre Over a year ago

@Ismail Hachimi But pd.concat cannot left merge. So for people who want to left merge multiple dataframes functools.reduce is the best way to go

Nimrod Morag Over a year ago

result_1 is the fastest and joins on the index

jezrael · Accepted Answer · 2017-06-02 12:14:56Z

There are 2 solutions for this, but it return all columns separately:

import functools

dfs = [df1, df2, df3]

df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='date'), dfs)
print (df_final)
          date     a_x   b_x       a_y      b_y   c_x         a        b   c_y
0  May 15,2017  900.00  0.2%  1,900.00  1000000  0.2%  2,900.00  2000000  0.2%

k = np.arange(len(dfs)).astype(str)
df = pd.concat([x.set_index('date') for x in dfs], axis=1, join='inner', keys=k)
df.columns = df.columns.map('_'.join)
print (df)
                0_a   0_b       1_a      1_b   1_c       2_a      2_b   2_c
date                                                                       
May 15,2017  900.00  0.2%  1,900.00  1000000  0.2%  2,900.00  2000000  0.2%

Pobaranchuk · Accepted Answer · 2022-01-18 17:31:19Z

19

Another way to combine: functools.reduce

From documentation:

For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5). The left argument, x, is the accumulated value and the right argument, y, is the update value from the iterable.

So:

from functools import reduce
dfs = [df1, df2, df3, df4, df5, df6]
df_final = reduce(lambda left,right: pd.merge(left,right,on='some_common_column_name'), dfs)

edited Jan 18, 2022 at 17:31

answered Mar 13, 2021 at 8:48

Pobaranchuk

89511 silver badges13 bronze badges

Comments

amance · Accepted Answer · 2023-02-07 18:34:04Z

You could also use dataframe.merge like this

df = df1.merge(df2).merge(df3)

UPDATE

Comparing performance of this method to the currently accepted answer

import timeit

setup = '''import pandas as pd
from functools import reduce
df_1 = pd.DataFrame({'date': {0: 'May 19, 2017', 1: 'May 18, 2017', 2: 'May 17, 2017', 3: 'May 15, 2017'}, 'a': {0: '1,200.00', 1: '1,100.00', 2: '1,000.00', 3: '1,901.00'}, 'b': {0: '0.1%', 1: '0.1%', 2: '0.1%', 3: '0.1%'}})
df_2 = pd.DataFrame({'date': {0: 'May 20, 2017', 1: 'May 18, 2017', 2: 'May 16, 2017', 3: 'May 15, 2017'}, 'a': {0: '2,200.00', 1: '2,100.00', 2: '2,000.00', 3: '2,902.00'}, 'b': {0: 1000000, 1: 1590000, 2: 1230000, 3: 1000000}, 'c': {0: '0.2%', 1: '0.2%', 2: '0.2%', 3: '0.2%'}})
df_3 = pd.DataFrame({'date': {0: 'May 21, 2017', 1: 'May 17, 2017', 2: 'May 16, 2017', 3: 'May 15, 2017'}, 'a': {0: '3,200.00', 1: '3,100.00', 2: '3,000.00', 3: '3,903.00'}, 'b': {0: 2000000, 1: 2590000, 2: 2230000, 3: 2000000}, 'c': {0: '0.3%', 1: '0.3%', 2: '0.3%', 3: '0.3%'}})
dfs = [df_1, df_2, df_3]'''


#methods from currently accepted answer
>>> timeit.timeit(setup=setup, stmt="reduce(lambda  left,right: pd.merge(left,right,on=['date'], how='outer'), dfs)", number=1000)
3.3471919000148773
>>> timeit.timeit(setup=setup, stmt="df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['date'], how='outer'), dfs).fillna('void')", number=1000)
4.079146400094032

#method demonstrated in this answer
>>> timeit.timeit(setup=setup, stmt="df = df_1.merge(df_2, on='date').merge(df_3, on='date')", number=1000)
2.7787032001651824

It looks almost too simple to work. But it does. How does it compare, performance-wise to the accepted answer?
@Harm just checked the performance comparison and updated my answer with the results

Allen Wang · Accepted Answer · 2017-08-24 20:42:44Z

@dannyeuu's answer is correct. pd.concat naturally does a join on index columns, if you set the axis option to 1. The default is an outer join, but you can specify inner join too. Here is an example:

x = pd.DataFrame({'a': [2,4,3,4,5,2,3,4,2,5], 'b':[2,3,4,1,6,6,5,2,4,2], 'val': [1,4,4,3,6,4,3,6,5,7], 'val2': [2,4,1,6,4,2,8,6,3,9]})
x.set_index(['a','b'], inplace=True)
x.sort_index(inplace=True)

y = x.__deepcopy__()
y.loc[(14,14),:] = [3,1]
y['other']=range(0,11)

y.sort_values('val', inplace=True)

z = x.__deepcopy__()
z.loc[(15,15),:] = [3,4]
z['another']=range(0,22,2)
z.sort_values('val2',inplace=True)


pd.concat([x,y,z],axis=1)

Kaibo · Accepted Answer · 2019-10-20 13:30:33Z

5

Look at this pandas three-way joining multiple dataframes on columns

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

answered Oct 20, 2019 at 13:30

Kaibo

1111 silver badge4 bronze badges

Comments

Vasco Ferreira · Accepted Answer · 2018-06-18 20:29:36Z

Thank you for your help @jezrael, @zipa and @everestial007, both answers are what I need. If I wanted to make a recursive, this would also work as intended:

def mergefiles(dfs=[], on=''):
    """Merge a list of files based on one column"""
    if len(dfs) == 1:
         return "List only have one element."

    elif len(dfs) == 2:
        df1 = dfs[0]
        df2 = dfs[1]
        df = df1.merge(df2, on=on)
        return df

    # Merge the first and second datafranes into new dataframe
    df1 = dfs[0]
    df2 = dfs[1]
    df = dfs[0].merge(dfs[1], on=on)

    # Create new list with merged dataframe
    dfl = []
    dfl.append(df)

    # Join lists
    dfl = dfl + dfs[2:] 
    dfm = mergefiles(dfl, on)
    return dfm

Nicolae Stroncea · Accepted Answer · 2021-01-28 16:57:05Z

@everestial007 's solution worked for me. This is how I improved it for my use case, which is to have the columns of each different df with a different suffix so I can more easily differentiate between the dfs in the final merged dataframe.

from functools import reduce
import pandas as pd
dfs = [df1, df2, df3, df4]
suffixes = [f"_{i}" for i in range(len(dfs))]
# add suffixes to each df
dfs = [dfs[i].add_suffix(suffixes[i]) for i in range(len(dfs))]
# remove suffix from the merging column
dfs = [dfs[i].rename(columns={f"date{suffixes[i]}":"date"}) for i in range(len(dfs))]
# merge
dfs = reduce(lambda left,right: pd.merge(left,right,how='outer', on='date'), dfs)

Marco Pérez · Accepted Answer · 2022-04-15 19:51:16Z

I had a similar use case and solved w/ below. Basically captured the the first df in the list, and then looped through the reminder and merged them where the result of the merge would replace the previous.

Edit: I was dealing w/ pretty small dataframes - unsure how this approach would scale to larger datasets. #caveatemptor

import pandas as pd
df_list = [df1,df2,df3, ...dfn]
# grab first dataframe
all_merged = df_list[0]
# loop through all but first data frame
for to_merge in df_list[1:]:
    # result of merge replaces first or previously
    # merged data frame w/ all previous fields
    all_merged = pd.merge(
        left=all_merged
        ,right=to_merge
        ,how='inner'
        ,on=['some_fld_across_all']
        )

# can easily have this logic live in a function
def merge_mult_dfs(df_list):
    all_merged = df_list[0]
    for to_merge in df_list[1:]:
        all_merged = pd.merge(
            left=all_merged
            ,right=to_merge
            ,how='inner'
            ,on=['some_fld_across_all']
            )
    return all_merged

zipa · Accepted Answer · 2017-06-02 14:55:37Z

0

If you are filtering by common date this will return it:

dfs = [df1, df2, df3]
checker = dfs[-1]
check = set(checker.loc[:, 0])

for df in dfs[:-1]:
    check = check.intersection(set(df.loc[:, 0]))

print(checker[checker.loc[:, 0].isin(check)])

edited Jun 2, 2017 at 14:55

answered Jun 2, 2017 at 12:34

zipa

28k6 gold badges45 silver badges62 bronze badges

2 Comments

Vasco Ferreira Over a year ago

but in this way it can only get the result for 3 files. What if I try with 4 files? Do I need to do: set(df1.loc[:, 0].intersection(set(df3.loc[:, 0]).intersection(set(df2.loc[:, 0])).intersection(set(df1.loc[:, 0])))?

zipa Over a year ago

@VascoFerreira I edited the code to match that situation as well.

niels t · Accepted Answer · 2021-07-09 05:20:35Z

0

For me the index is ignored without explicit instruction. Example:

    > x = pandas.DataFrame({'a': [1,2,2], 'b':[4,5,5]})
    > x
        a   b
    0   1   4
    1   2   5
    2   2   5

    > x.drop_duplicates()
        a   b
    0   1   4
    1   2   5

( duplicated lines removed despite different index)

answered Jul 9, 2021 at 5:20

niels t

11 bronze badge

Collectives™ on Stack Overflow

How to merge multiple dataframes

13 Answers 13

Short answer

Long answer

3 Comments

3 Comments

3 Comments

Comments

Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

Short answer

Long answer

3 Comments

3 Comments

3 Comments

Comments

Comments

3 Comments

Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

Linked

Related