I am running into the problem when executing below codes:
from pyspark.sql import functions as F
from pyspark.sql import Row, HiveContext
hc = HiveContext()
rows1 = [Row(id1 = '2', id2 = '1', id3 = 'a'),
Row(id1 = '3', id2 = '2', id3 = 'a'),
Row(id1 = '4', id2 = '3', id3 = 'b')]
df1 = hc.createDataFrame(rows1)
df2 = df1.filter(F.col("id3")=="a")
df3 = df1.join(df2, df1.id2 == df2.id1, "inner")
When I run above code, df3 is an empty DataFrame. However: If I change the code to below, it is giving the correct result (DataFrame of 2 rows):
from pyspark.sql import functions as F
from pyspark.sql import Row, HiveContext
hc = HiveContext()
rows1 = [Row(id1 = '2', id2 = '1', id3 = 'a'),
Row(id1 = '3', id2 = '2', id3 = 'a'),
Row(id1 = '4', id2 = '3', id3 = 'b')]
df1 = hc.createDataFrame(rows1)
rows2 = [Row(id1 = '2', id2 = '1', id3 = 'a'),
Row(id1 = '3', id2 = '2', id3 = 'a'),
Row(id1 = '4', id2 = '3', id3 = 'b')]
df1_temp = hc.createDataFrame(rows2)
df2 = df1_temp.filter(F.col("id3")=="a")
df3 = df1.join(df2, df1.id2 == df2.id1, "inner")
So My question is: why do I have to create a temp dataframe here? Also, if I can't get the HiveContext in my part of the project, how can I make a duplicate dataframe on top of the existing dataframe?