How to select all columns for rows with max value

Question

I need to select all columns from a dataframe by grouping on 'ID'. But when I do that I only get the ID and 'value'. I need all columns

a=df.groupby(df['id']).agg({"date": "max"}
a.show()

This only selects 'id' and 'date' columns. There are other columns. How do I select all columns for the max value in date.

Possible duplicate of GroupBy column and filter rows with maximum value in Pyspark — pault
– pault, Commented Sep 6, 2018 at 15:11

Ankit Kumar Namdeo · Accepted Answer · 2021-05-26 04:36:34Z

2

In spark there are two ways either join it with the previous dataframe like this :

a=df.groupby(df['id']).agg({"date": "max"}
df = df.join(
    a,
    on = "id",
    how = "inner"
)
df.show()

or use window partition by like this :

from pyspark.sql import Window
import pyspark.sql.functions as F
window = Window.partitionBy("id")
a = df.withColumn(
    "max",
    (F.max(F.col("date")).over(window))
)
a.show()

I would say to prefer the first one as it is less costly even after join.

edited May 26, 2021 at 4:36

answered Sep 6, 2018 at 19:07

Ankit Kumar Namdeo

1,4641 gold badge12 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alex3465 Over a year ago

for second method TypeError: withColumn() missing 1 required positional argument: 'col'

Ankit Kumar Namdeo Over a year ago

Thanks for pointing it out, edited the solution. withColumn takes two positional arguments first is the name of the column and another expression for new column

Collectives™ on Stack Overflow

How to select all columns for rows with max value

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related