2

I'm trying to filter a dataset by order status. This is my code:

 df1=all_in_all_df.groupBy("productName") \
 .agg(F.max('orderItemSubTotal')) \
 .filter(col("orderStatus") == "CLOSED") \
 .show()

But when I run the code, I get the following error:

AnalysisException: cannot resolve 'orderStatus' given input columns: [max(orderItemSubTotal), productName]; 'Filter ('orderStatus = CLOSED)

Removing the .filter() helps displaying a result but I need to filter the data.

1 Answer 1

1

The aggregation restricts the number of resulting columns to the ones used for the grouping (in group by clause) and the result of the aggregation.
Thus, there is no orderStatus column anymore.

If you want to be able to filter on it, do it before the aggregation (but only filtered rows will be taken into account for the aggregation) or integrate them in the group by clause (again, the aggregation will be made by status, not globally, but in this second case you will have all statuses, with related aggregations, available).

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.