1

Dataset is something like this (there will be duplicate rows in the original):

Code:

import pandas as pd

df_in = pd.DataFrame({'email_ID': {0: 'mail_1',
  1: 'mail_1',
  2: 'mail_1',
  3: 'mail_1',
  4: 'mail_1',
  5: 'mail_1',
  6: 'mail_2',
  7: 'mail_2',
  8: 'mail_2',
  9: 'mail_2',
  10: 'mail_2',
  11: 'mail_2'},
 'time_stamp': {0: '2021-09-10 09:01:56.340259',
  1: '2021-09-10 09:01:56.672814',
  2: '2021-09-10 09:01:57.471423',
  3: '2021-09-10 09:01:57.480891',
  4: '2021-09-10 09:01:57.484644',
  5: '2021-09-10 09:01:57.984644',
  6: '2021-09-10 09:01:56.340259',
  7: '2021-09-10 09:01:56.672814',
  8: '2021-09-10 09:01:57.471423',
  9: '2021-09-10 09:01:57.480891',
  10: '2021-09-10 09:01:57.484644',
  11: '2021-09-10 09:01:57.984644'},
 'screen': {0: 'a',
  1: 'b',
  2: 'c',
  3: 'd',
  4: 'c',
  5: 'b',
  6: 'a',
  7: 'b',
  8: 'c',
  9: 'b',
  10: 'c',
  11: 'd'}})

df_in['time_stamp'] = df_in['time_stamp'].astype('datetime64[ns]')

df_in

Output should be this:

Code:

import pandas as pd

df_out = pd.DataFrame({'email_ID': {0: 'mail_1',
  1: 'mail_1',
  2: 'mail_1',
  3: 'mail_1',
  4: 'mail_1',
  5: 'mail_1',
  6: 'mail_2',
  7: 'mail_2',
  8: 'mail_2',
  9: 'mail_2',
  10: 'mail_2',
  11: 'mail_2'},
 'time_stamp': {0: '2021-09-10 09:01:56.340259',
  1: '2021-09-10 09:01:56.672814',
  2: '2021-09-10 09:01:57.471423',
  3: '2021-09-10 09:01:57.480891',
  4: '2021-09-10 09:01:57.484644',
  5: '2021-09-10 09:01:57.984644',
  6: '2021-09-10 09:01:56.340259',
  7: '2021-09-10 09:01:56.672814',
  8: '2021-09-10 09:01:57.471423',
  9: '2021-09-10 09:01:57.480891',
  10: '2021-09-10 09:01:57.484644',
  11: '2021-09-10 09:01:57.984644'},
 'screen': {0: 'a',
  1: 'b',
  2: 'c',
  3: 'd',
  4: 'c',
  5: 'b',
  6: 'a',
  7: 'b',
  8: 'c',
  9: 'b',
  10: 'c',
  11: 'd'},
 'series1': {0: 0,
  1: 1,
  2: 2,
  3: 3,
  4: 0,
  5: 1,
  6: 0,
  7: 1,
  8: 2,
  9: 3,
  10: 4,
  11: 5},
 'series2': {0: 0,
  1: 0,
  2: 0,
  3: 0,
  4: 1,
  5: 1,
  6: 2,
  7: 2,
  8: 2,
  9: 2,
  10: 2,
  11: 2}})

df_out['time_stamp'] = df['time_stamp'].astype('datetime64[ns]')

df_out

'series1' column values starts row by row as 0, 1, 2, and so on but resets to 0 when:

  1. 'email_ID' column value changes.
  2. 'screen' column value == 'd'

'series2' column values starts with 0 and increments by 1 whenever 'series1' resets.

My progress:

series1 = [0]

x = 0

for index in df[1:].index:

  if ((df._get_value(index - 1, 'email_ID')) == df._get_value(index, 'email_ID')) and (df._get_value(index - 1, 'screen') != 'd'):

    x += 1

    series1.append(x)
  
  else:
    x = 0

    series1.append(x)


df['series1'] = series1
df

series2 = [0]

x = 0

for index in df[1:].index:

  if df._get_value(index, 'series1') - df._get_value(index - 1, 'series1') == 1:

    series2.append(x)
  
  else:
    
    x += 1

    series2.append(x)


df['series2'] = series2
df

I think the code above is working, I'll test answered codes and select the best in a few hours, thank you.

1 Answer 1

1

You can use:

m = df_in['screen']=='d'
df_in['pf'] = np.where(m, 1, np.nan)
df_in.loc[m, 'pf'] = df_in[m].cumsum()
grouper = df_in.groupby('email_ID')['pf'].bfill()
df_in['series1'] = df_in.groupby(grouper).cumcount()
df_in['series2'] = df_in.groupby(grouper.fillna(0), sort=False).ngroup()
df_in.drop('pf', axis=1, inplace=True)

Explanation:

  • First locate the rows where 'screen' is 'd' and then use cumsum() to identify their numbers. This is accomplished by:
  df_in['pf'] = np.where(m, 1, np.nan)
  df_in.loc[m, 'pf'] = df_in[m].cumsum()
  • Then use bfill to backfill the NaN values with the positions where 'screen' shows 'd'. This will ensure the above rows are of the same group, but do it per email_ID. This is accomplished by:
  grouper = df_in.groupby('email_ID')['pf'].bfill()
  • Then it is straightforward to see that once you define a grouper, you can use cumcount to get the series1 column. This is done by:
  df_in['series1'] = df_in.groupby(grouper).cumcount()
  • Then get series2 column by using ngroup(). But make sure the groupby is done with sort=False. Done by:
  df_in['series2'] = df_in.groupby(grouper.fillna(0), sort=False).ngroup()
  • Finally drop the unwanted column pf.
  df_in.drop('pf', axis=1, inplace=True)
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.