4

I need to select all columns from a dataframe by grouping on 'ID'. But when I do that I only get the ID and 'value'. I need all columns

a=df.groupby(df['id']).agg({"date": "max"}
a.show()

This only selects 'id' and 'date' columns. There are other columns. How do I select all columns for the max value in date.

1

1 Answer 1

2

In spark there are two ways either join it with the previous dataframe like this :

a=df.groupby(df['id']).agg({"date": "max"}
df = df.join(
    a,
    on = "id",
    how = "inner"
)
df.show()

or use window partition by like this :

from pyspark.sql import Window
import pyspark.sql.functions as F
window = Window.partitionBy("id")
a = df.withColumn(
    "max",
    (F.max(F.col("date")).over(window))
)
a.show() 

I would say to prefer the first one as it is less costly even after join.

Sign up to request clarification or add additional context in comments.

2 Comments

for second method TypeError: withColumn() missing 1 required positional argument: 'col'
Thanks for pointing it out, edited the solution. withColumn takes two positional arguments first is the name of the column and another expression for new column

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.