Conditional merge of other multiple Pandas DataFrames

Question

I have four pandas DataFrames (A, B, C, and D). A has a series of timestamps and a single column which refers to one of the other DataFrames:

A

Timestamp    Source
-----------  ------
2012-4-3     B
2013-12-20   C
2012-3-5     C
2014-12-7    D
2012-7-10    B
...

The other DataFrames hold more data:

B

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    1.5  1.3
2012-1-2    2.3  5.6
2012-1-3    3.4  3.3
...
2014-3-31   0.8  2.1

C

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    9.2  5.6
2012-1-2    4.8  7.6
2012-1-3    2.7  6.4
...
2014-3-31   7.0  6.5

D

Timestamp   Foo  Bar
----------- ---- ----
2012-1-1    6.8  4.2
2012-1-2    4.2  9.3
2012-1-3    5.5  0.7
...
2014-3-31   6.3  2.0

I want to construct a single DataFrame from A, B, C, and D that has three columns (Timestamp, Foo, and Bar) where the values of Foo and Bar come from the corresponding Timestamp in the DataFrame listed as the Source in A.

Not all Timestamps in A appear in the other three DataFrames, in which case, I'd like the values of Foo and Bar to be np.nan. Not all timestamps in B, C, and D appear in A, and simply won't appear in the final DataFrame.

My current approach is to loop through each row in A and return the values from the corresponding Source DataFrame:

srcs = {'B': B, 'C': C, 'D': D}
A['Foo'] = np.nan
A['Bar'] = np.nan

for i in range(len(A)):
    ts = A.iloc[i].Timestamp
    src = A.iloc[i].Source
    A.iloc[i].Foo = srcs[src][srcs[src].Timestamp == ts].Foo
    A.iloc[i].Bar = srcs[src][srcs[src].Timestamp == ts].Bar

There has to be a more efficient, more Pandithic(?) way to perform this action?

Hmm, one method would be to add a Source column to each df with B, C, D set to B,C,D respectively and then merge all of them on TimeStamp and Source, not sure how messy it would be though — EdChum
– EdChum, Commented May 13, 2016 at 14:52
Wouldn't this lead to a df with 6 separate columns (e.g., 'Foo_x', 'Bar_x', 'Foo_y', 'Bar_y', 'Foo', 'Bar')? How would I combine them into two columns ('Foo', and 'Bar') based on the source? — Neal Kruis
– Neal Kruis, Commented May 13, 2016 at 16:11

Brad Campbell · Accepted Answer · 2016-05-13 15:59:36Z

It looks like you could use a multi-index to do this. Your index would be consist of the Timestamp and Source. You can so this with the set_index method on a DataFrame.

Here is some code to create some fake DataFrames, each with the MultiIndex.

# Imports for creating fake data
from random import random
from random import choice

# Setup the sample data
A = pd.DataFrame({'TimeStamp':range(20), 'Source':[choice(others) for i in range(20)]})
# Create the MultiIndex on A
A.set_index(['TimeStamp', 'Source'], inplace=True)
A['Bar'] = [np.nan] * len(A)
A['Foo'] = [np.nan] * len(A)

B = pd.DataFrame({'TimeStamp':range(5), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})
C = pd.DataFrame({'TimeStamp':range(5,10), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})
D = pd.DataFrame({'TimeStamp':range(10,15), 
                  'Foo':[random()*5+5 for i in range(5)], 
                  'Bar':[random()*5+5 for i in range(5)]})

sources = {'B':B, 'C':C, 'D':D}

# create the MultiIndex on the Source data sets
for s, df in sources.items():
    df['Source'] = [s]*len(df)
    df.set_index(['TimeStamp', 'Source'], inplace=True)

Now you can index the Source data sets (B, C, and D) using the index on A.

for s, df in sources.items():    

    temp = df.loc[A.index]  # the source data set indexed by A's index
                            # this will contain NaN's where df does not
                            # have corresponding index entries
    temp.dropna(inplace=True) # dropping the NaN values leaves you with 
                             # only the values in df matching the index in A
    if len(temp) > 0:
        A.loc[temp.index] = temp  # now assign the data to A

print(A)

The result looks like:

                       Bar       Foo
TimeStamp Source                    
0         D            NaN       NaN
1         C            NaN       NaN
2         D            NaN       NaN
3         B       7.927154  8.581380
4         B       7.638422  5.970348
5         D            NaN       NaN
6         C       6.938001  6.417248
7         B            NaN       NaN
8         C       5.131940  9.144621
9         B            NaN       NaN
10        D       9.186963  5.991877
11        D       8.070543  7.735040
12        C            NaN       NaN
13        B            NaN       NaN
14        C            NaN       NaN
15        D            NaN       NaN
16        C            NaN       NaN
17        C            NaN       NaN
18        C            NaN       NaN
19        B            NaN       NaN

piRSquared · Accepted Answer · 2016-05-13 16:09:15Z

Setup

import pandas as pd
from StringIO import StringIO

texta = """Timestamp    Source
2012-4-3     B
2012-4-2     B
2013-12-20   C
2012-3-5     C
2014-12-7    D
2012-7-10    B"""

A = pd.read_csv(StringIO(texta), delim_whitespace=1, parse_dates=[0])

textb = """Timestamp   Foo  Bar
2012-1-1    1.5  1.3
2012-4-3    3.1  4.1
2012-1-2    2.3  5.6
2012-1-3    3.4  3.3
2014-3-31   0.8  2.1"""

B = pd.read_csv(StringIO(textb), delim_whitespace=1, parse_dates=[0])

textc = """Timestamp   Foo  Bar
2012-1-1    9.2  5.6
2012-3-5    4.8  7.6
2012-1-2    4.8  7.6
2012-1-3    2.7  6.4
2014-3-31   7.0  6.5"""

C = pd.read_csv(StringIO(textc), delim_whitespace=1, parse_dates=[0])

textd = """Timestamp   Foo  Bar
2012-1-1    6.8  4.2
2012-1-2    4.2  9.3
2012-1-3    5.5  0.7
2014-3-31   6.3  2.0"""

D = pd.read_csv(StringIO(textd), delim_whitespace=1, parse_dates=[0])

Then I combine with pd.concat just B C and D

bdf = pd.concat([B, C, D], keys=['B', 'C', 'D'])
bdf.reset_index(level=1, inplace=1, drop=1)
bdf.index.name = 'Source'
bdf.reset_index(inplace=1)

print bdf

It looks like this:

   Source  Timestamp  Foo  Bar
0       B 2012-01-01  1.5  1.3
1       B 2012-04-03  3.1  4.1
2       B 2012-01-02  2.3  5.6
3       B 2012-01-03  3.4  3.3
4       B 2014-03-31  0.8  2.1
5       C 2012-01-01  9.2  5.6
6       C 2012-03-05  4.8  7.6
7       C 2012-01-02  4.8  7.6
8       C 2012-01-03  2.7  6.4
9       C 2014-03-31  7.0  6.5
10      D 2012-01-01  6.8  4.2
11      D 2012-01-02  4.2  9.3
12      D 2012-01-03  5.5  0.7
13      D 2014-03-31  6.3  2.0

Finally

a simple merge

A.merge(bdf, how='left')

Looks like:

   Timestamp Source  Foo  Bar
0 2012-04-03      B  3.1  4.1
1 2012-04-02      B  NaN  NaN
2 2013-12-20      C  NaN  NaN
3 2012-03-05      C  4.8  7.6
4 2014-12-07      D  NaN  NaN
5 2012-07-10      B  NaN  NaN

Collectives™ on Stack Overflow

Conditional merge of other multiple Pandas DataFrames

2 Answers 2

Comments

Setup

Finally

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Setup

Finally

Comments

Related