0

How to calculate the join of two Dataframes using multiple columns as key? For example DF1 , DF2 are the two dataFrame.

This is the way by which we can calculate the join,

JoinDF = DF1.join(DF2, DF1("column1") === DF2("column11") && DF1("column2") === DF2("column22"), "outer") 

But my problem is how to access the multiple columns if they are stored in an arrays like :

DF1KeyArray=Array{column1,column2}
DF2KeyArray=Array{column11,column22}

then It is not possible to calculate the join by this method

JoinDF = DF1.join(DF2, DF1(DF1KeyArray)=== DF2(DF2KeyArray), "outer")

In this case error was :

<console>:128: error: type mismatch;
found   : Array[String]
required: String

Is there any way to access multiple columns as keys stored in an Array for calculation of join?

2
  • 1
    Please format you question ! This is not readable. Add the programming language tag too ! Commented Feb 2, 2016 at 10:14
  • @eliasah Scala is the programming language . Commented Feb 2, 2016 at 10:28

1 Answer 1

10

You can simply create joinExprs programmatically:

val df1KeyArray: Array[String] = ???
val df2KeyArray: Array[String] = ???

val df1: DataFrame = ???
val df2: DataFrame = ???

val joinExprs = df1KeyArray
  .zip(df2KeyArray)
  .map{case (c1, c2) => df1(c1) === df2(c2)}
  .reduce(_ && _)

df1.join(df2, joinExprs, "outer")

See also Including null values in an Apache Spark Join

Sign up to request clarification or add additional context in comments.

2 Comments

nice use of .reduce(_ && _)
Best way to join using multiple columns. @zero323 Please add that if someone wants to return true (rather than NULL) if both inputs are NULL, then === must be replaced with <=> (EqualNullSafe)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.