2

I have the following DataFrame

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   6|null|null|
|   B|null|   5|null|
|   C|null|null|   7|
|   B|null|null|   4|
|   B|null|   2|null|
|   B|null|   1|null|
|   A|   4|null|null|
+----+----+----+----+

What I would like to do in Spark is to return all entries in col1 in the case it has a maximum value for one of the columns col2, col3 or col4.

This snippet won't do what I want:

df.groupBy("col1").max("col2","col3","col4").show()

And this one just gives the max only for one column (1):

df.groupBy("col1").max("col2").show()

I even tried to merge the single outputs by this:

//merge rows
val rows = test1.rdd.zip(test2.rdd).map{
  case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
//merge schemas
val schema = StructType(test1.schema.fields ++ test2.schema.fields)
// create new df
val test3: DataFrame = sqlContext.createDataFrame(rows, schema)

where test1 and test2 are DataFramesdone with queries as (1).

So how do I achive this nicely??

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   6|null|null|
|   B|null|   5|null|
|   C|null|null|   7|
+----+----+----+----+

Or even only the distinct values:

+----+
|col1|
+----+
|   A|
|   B|
|   C|
+----+

Thanks in advance! Best

0

2 Answers 2

2

You can use some thing like below :-

sqlcontext.sql("select x.* from table_name x ,
(select max(col2) as a,max(col3) as b, max(col4) as c from table_name ) temp 
where a=x.col2 or b= x.col3 or c=x.col4")

Will give the desired result.

Sign up to request clarification or add additional context in comments.

Comments

1

It can be solved like this:

df.registerTempTable("temp")

spark.sql("SELECT max(col2) AS max2, max(col3) AS max3, max(col4) AS max4 FROM temp").registerTempTable("max_temp")

spark.sql("SELECT col1 FROM temp, max_temp WHERE col2 = max2 OR col3 = max3 OR col4 = max4").show

1 Comment

Thanks! I just changed sparkto sqlContext in my case and it worked. Is there any best practice when to use udfs for DataFrames and when working with SQL in sqlContext of tempTable?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.