Get distinct values of specific column with max of different columns

Question

I have the following DataFrame

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   6|null|null|
|   B|null|   5|null|
|   C|null|null|   7|
|   B|null|null|   4|
|   B|null|   2|null|
|   B|null|   1|null|
|   A|   4|null|null|
+----+----+----+----+

What I would like to do in Spark is to return all entries in col1 in the case it has a maximum value for one of the columns col2, col3 or col4.

This snippet won't do what I want:

df.groupBy("col1").max("col2","col3","col4").show()

And this one just gives the max only for one column (1):

df.groupBy("col1").max("col2").show()

I even tried to merge the single outputs by this:

//merge rows
val rows = test1.rdd.zip(test2.rdd).map{
  case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
//merge schemas
val schema = StructType(test1.schema.fields ++ test2.schema.fields)
// create new df
val test3: DataFrame = sqlContext.createDataFrame(rows, schema)

where test1 and test2 are DataFramesdone with queries as (1).

So how do I achive this nicely??

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   6|null|null|
|   B|null|   5|null|
|   C|null|null|   7|
+----+----+----+----+

Or even only the distinct values:

+----+
|col1|
+----+
|   A|
|   B|
|   C|
+----+

Thanks in advance! Best

Prasad Khode · Accepted Answer · 2017-03-02 12:09:55Z

2

You can use some thing like below :-

sqlcontext.sql("select x.* from table_name x ,
(select max(col2) as a,max(col3) as b, max(col4) as c from table_name ) temp 
where a=x.col2 or b= x.col3 or c=x.col4")

Will give the desired result.

edited Mar 2, 2017 at 12:09

Prasad Khode

6,77712 gold badges47 silver badges62 bronze badges

answered Mar 2, 2017 at 10:55

Ashish Singh

5333 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alex Karpov · Accepted Answer · 2017-03-02 11:21:27Z

1

It can be solved like this:

df.registerTempTable("temp")

spark.sql("SELECT max(col2) AS max2, max(col3) AS max3, max(col4) AS max4 FROM temp").registerTempTable("max_temp")

spark.sql("SELECT col1 FROM temp, max_temp WHERE col2 = max2 OR col3 = max3 OR col4 = max4").show

answered Mar 2, 2017 at 11:21

Alex Karpov

5644 silver badges13 bronze badges

1 Comment

Ken Jiiii Over a year ago

Thanks! I just changed sparkto sqlContext in my case and it worked. Is there any best practice when to use udfs for DataFrames and when working with SQL in sqlContext of tempTable?

Collectives™ on Stack Overflow

Get distinct values of specific column with max of different columns

2 Answers 2

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Related