1

I'd like a new column in my dataset that shows the preceding actions when the identifier is the same, combined with the action in the current row.

So far I've tried to loop through the df, but this only captures the preceding row and not all rows in each group.

Starting with the data like this:

requestTime     identifier  aggregation
38:00.5         123         abc
38:02.2         123         def
38:03.9         123         ghi
38:04.9         456         abc

This is the code I've tried so far:

trial["newAgg"] = trial["aggregation"].shift(1)
trial["newId"] = trial["identifier"].shift(1)

for index, row in trial.iterrows():
    if row.identifier == row.newId:
        trial["newAgg"] + " - " + trial["aggregation"]
    else:
        trial["newAgg"] = trial["aggregation"]

which outputs:

requestTime identifier  aggregation newAgg              newId
38:00.5     123         abc         abc 
38:02.2     123         def         abc - def           123
38:03.9     123         ghi         def - ghi           123
38:04.9     456         abc         abc                 456

But I'd like the output to be as follows:

requestTime identifier  aggregation newAgg              newId
38:00.5     123         abc         abc 
38:02.2     123         def         abc - def           123
38:03.9     123         ghi         abc - def - ghi     123
38:04.9     456         abc         abc                 456
1
  • have you tried trial["newAgg"] = trial["newAgg"].shift(1) + " - " + trial["aggregation"] Commented Jul 29, 2019 at 13:00

3 Answers 3

1

From what I can tell the else statement is being tripped right at the beginning before a newID is initialized causing it to equal "def" for the following row.

If you want abc to initialize beforehand and then add on it may be better to have it as a variable above and add the following changes onward.

trial["newAgg"] = trial["aggregation"].shift(1)
trial["newId"] = trial["identifier"].shift(1)
abcHold = "abc"

for index, row in trial.iterrows():
    if row.identifier == row.newId:
        abcHold + " - " + trial["newAgg"] + " - " + trial["aggregation"]
    else:
        trial["newAgg"] = trial["aggregation"]

Or something along those lines. Take my advice with a grain of salt I haven't played around with Pandas and Python that much.

Best of luck!

Sign up to request clarification or add additional context in comments.

Comments

1

Rather than looping, you can use pandas goupby with apply and let a custom function do the job.
In this case, I've used a lambda function.

outcol = df.groupby('identifier').apply(lambda x : pd.Series([' - '.join(x['aggregation'].iloc[0:i]) for i in range(1,len(x)+1)]))
outcol.reset_index(drop=True, inplace=True)
df['newAgg'] = outcol

groupby automatically selects subsets of the dataframe with the same 'identifier' value on which the custom function is applied.
In this case, I use a comprehension list to select the strings to be joined.
The reset_index is needed to get rid of the multiindex in order to join back the colum to the original dataframe.

Final result is:

  requestTime  identifier aggregation           newAgg
0     38:00.5         123         abc              abc
1     38:02.2         123         def        abc - def
2     38:03.9         123         ghi  abc - def - ghi
3     38:04.9         456         abc              abc

Comments

1

Assuming you have pandas df,something like this should work.

trial['newAgg'] = trial.groupby(['identifier'])['aggregation'].apply(lambda x: (x + '-').cumsum().str.strip())

*EDIT:*According to your snippet this should work for trial["newAgg"] no need to write else

for index, row in trial.iterrows():
    if row.identifier == row.newId:
        trial["newAgg"] = trial.groupby(['identifier'])['aggregation'].apply(lambda x: (x + '-').cumsum().str.strip())
        trial["newAgg"] = [i[ : -1] for i in list(trial['newAgg'])]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.