1

I am working with some advertising data, such as email data. I have two data sets:

  1. one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.

    import pandas as pd
    
    df_emailed=pd.DataFrame()
    df_emailed['person']=['A','A','A','A','B','B','B']
    df_emailed['day']=[2,4,8,9,1,2,5]
    df_emailed
    print(df_emailed)
    
      person  day
    0      A    2
    1      A    4
    2      A    8
    3      A    9
    4      B    1
    5      B    2
    6      B    5
    
  2. I have a summary dataframe that says whether someone converted, and which day they converted.

    df_summary=pd.DataFrame()
    df_summary['person']=['A','B']
    df_summary['days_max']=[10,5]
    df_summary['convert']=[1,0]
    print(df_summary)
    
      person  days_max  convert
    0      A        10        1
    1      B         5        0
    

I would like to combine these into a final dataframe that says, for each person:

  • 1 to max date,
  • whether they were emailed (0,1) and on the last day in the dataframe,
  • whether they converted or not (0,1).

We are assuming they convert on the last day in the dataframe.

I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?

Desired result

df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)

   person  day  emailed  convert
0       A    1        0        0
1       A    2        1        0
2       A    3        0        0
3       A    4        1        0
4       A    5        0        0
5       A    6        0        0
6       A    7        0        0
7       A    8        1        0
8       A    9        1        0
9       A   10        0        1
10      B    1        1        0
11      B    2        1        0
12      B    3        0        0
13      B    4        0        0
14      B    5        1        0

Thank you and happy holidays!

1
  • great catch thank you. made edits Commented Dec 26, 2017 at 17:04

1 Answer 1

1

A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to

  • set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
  • groupby to group on person
  • apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
  • fillna to fill NaNs in the convert column generated as a result of the reindex
  • assign to create a dummy column for emailed that we'll set later.

Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.

Finally, use reset_index to bring the index out as columns.

def f(x):
    return x.reindex(np.arange(1, x.index.max() + 1))

df = df2.set_index('days_max')\
        .rename_axis('day')\
        .groupby('person')['convert']\
        .apply(f)\
        .fillna(0)\
        .astype(int)\
        .to_frame()\
        .assign(emailed=0)

df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()

   person  day  convert  emailed
0       A    1        0        0
1       A    2        0        1
2       A    3        0        0
3       A    4        0        1
4       A    5        0        0
5       A    6        0        0
6       A    7        0        0
7       A    8        0        1
8       A    9        0        1
9       A   10        1        0
10      B    1        0        1
11      B    2        0        1
12      B    3        0        0
13      B    4        0        0
14      B    5        0        1

Where

df1 = df_emailed

and,

df2 = df_summary 
Sign up to request clarification or add additional context in comments.

5 Comments

I AM NOT WORTHY, I AM NOT WORTHY, I AM NOT WORTHY.
@TrexionKameha I'll take that as the answer being useful to you. Happy holidays :-)
Yes you did! Thank you. How would I add another column to the key, for example, campaign? IE for person A, and person B, I had capaigns X, Y, and Z, and I would like convert at each step. Is that simple? I tried myself and ran into an issue with the duplicate key.
@TrexionKameha okay, that may or may not need a different answer. I'm finding it hard to visualise. Is there any possibility of opening a new question? Would help.
Sorry please disregard, there were duplicates with my data. I changed all of the indexes and it worked. thank you!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.