Create DataFrame column from a groupby

Question

Suppose I have a DataFrame like

import pandas as pd
df = pd.DataFrame({
    'Id'    : [1,2,3,4,5,6,7,8,9],
    'Group' : [1,1,2,2,2,2,3,3,3],
    'Value_to_compare' : [2,1,5,8,2,3,10,23,17],
    'Other_value' : [0,3,2,6,3,4,2,7,1]
})

I would like to create a new column, say Value_of_Highest, displaying for each row Other_value of element having the highest Value_to_compare of its Group. For example, here:

Group 1 has 2 elements, its highest Value_to_compare is 2, for Id = 1, for which Other_value is 0
Group 2 has 4 elements, highest Value_to_compare is 8, for Id = 4, for which Other_value is 6
Group 3 has 3 elements, highest Value_to_compare is 23, for Id = 8, for which Other_value is 7

So I would like to add a column so that df becomes

This is the best way I know to do this:

def my_func(x):
    x = x.sort_values('Value_to_compare',ascending = False)
    Value_of_Highest = x.head(1)['Other_value'].values[0]
    return pd.Series([Value_of_Highest], index=['Value_of_Highest']) 

grouped = df.groupby('Group').apply(my_func).reset_index()

df = df.merge(grouped)

I am pretty sure there is a far more elegant and efficient way to do this in Python/Pandas.

Edit: after first answer from @CameronRiddell, I realized my example was flawed. I corrected it and @CameronRiddell edited his answer, which works well.

Cameron Riddell · Accepted Answer · 2020-10-15 16:16:17Z

This is a great application of groupby and transform essentially, using transform from a groupby object applies a function to each group, but then returns a Series or DataFrame of the same size as that group. This results in DataFrame/Series whose shape is the same as the original shape along the groupby axis. (e.g. in your case, the result of a groupby/transform will have the same number of rows as your original dataframe).

df["Value_of_Highest"] = df.groupby("Group")["Value_to_compare"].transform("max")

print(df)

   Id  Group  Value_to_compare  Other_value  Value_of_Highest
0   1      1                 2            0                 2
1   2      1                 1            3                 2
2   3      2                 5            2                 8
3   4      2                 8            6                 8
4   5      2                 2            3                 8
5   6      2                 3            4                 8
6   7      3                10            2                23
7   8      3                23            7                23
8   9      3                17            1                23

df.groupby("Group"): group the dataframe by our "Group" column
["Value_to_compare"]: From each of those groups, select the "Value_to_compare" column
.transform("max"): Get the maximum value of our selected column for each group. Then return a Series who has the same number of rows as the group

Edit: To obtain a value based on the location of a maximum, we use idxmax(). This returns the position where a max/min occurs. So for our use case:

we'll groupby "Group" and select the "Value_to_compare" column
Then we get the index of the maximum values of "Value_to_compare"

highest_vtc_indices = df.groupby("Group")["Value_to_compare"].idxmax()

print(highest_vtc_indices)
Group
1    0
2    3
3    7
Name: Value_to_compare, dtype: int64

0, 3, 7 corresponds to the row ID where the maximum values of "Value_to_compare" occur in each group.

Now that we know the row these maximums occurred in:

We'll need to subset our original dataframe to get the "Other_value" in each of these rows
We'll also need to get the "Group" column associated with these rows, so that we can align these values properly later

# Obtain "Group" & "Other_value" at rows 0, 3, 7
  # which were the rows for each highest Value_to_compare per group
highest_other_values = df.loc[highest_vtc_indices, ["Group", "Other_value"]]

# Rename Other_value to Value_of_Highest
highest_other_values = highest_other_values.rename(columns={"Other_value": "Value_of_Highest"})

print(highest_other_values)
   Group  Value_of_Highest
0      1                 0
3      2                 6
7      3                 7

Now that we have the corresponding "Other_value" to the maximum "Value_to_compare" per group, we'll use a merge to align our new smaller dataframe back to our original dataframe. This will broadcast "Value_of_Highest" across the "Group" column from the original dataframe.

# Merge this new dataframe back to our old one to broadcast "Value_of_Highest" across each group
final = df.merge(highest_other_values, on="Group")

print(final)
   Id  Group  Value_to_compare  Other_value  Value_of_Highest
0   1      1                 2            0                 0
1   2      1                 1            3                 0
2   3      2                 5            2                 6
3   4      2                 8            6                 6
4   5      2                 2            3                 6
5   6      2                 3            4                 6
6   7      3                10            2                 7
7   8      3                23            7                 7
8   9      3                17            1                 7

it is not exactly that, though, since you duplicated Value_to_compare instead of Other_value.
CameronRiddell I realize that my example didn't exactly reflect what I was asking. I have edited so it shows values from Other_value instead of Value_to_compare. I'm still lost, could you kindly help?
I've updated my answer to address this edit! (kept my original answer and put the new one below it)
Wonderful, thank you very much indeed! I will think about idmax() next time I need this kind of thing. :)

Collectives™ on Stack Overflow

Create DataFrame column from a groupby

1 Answer 1

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Related