How can a deep-copy of a DataFrame be requested - without resorting to a full re-computation of the original DataFrame contents?
The purpose will be in performing a self-join on a Spark Stream.
Dataframes are immutable. That means you don't have to do deep-copies, you can reuse them multiple times and on every operation new dataframe will be created and original will stay unmodified.
For example:
val df = List((1),(2),(3)).toDF("id")
val df1 = df.as("df1") //second dataframe
val df2 = df.as("df2") //third dataframe
df1.join(df2, $"df1.id" === $"df2.id") //fourth dataframe and df is still unmodified
It seems like a waste of resources, but since all data in dataframe is also immutable, then all four dataframes can reuse references to objects inside them.
Common approach:
val asdfDF = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "text")
val columns = asdfDF.schema.fields.map(_.name).toSeq
val column0 = columns.head
val columnsRest = columns.tail
val columnTemp = column0 + UUID.randomUUID().toString
val asdfDF2 = asdfDF
.withColumn(columnTemp, col(column0))
.drop(column0)
.withColumnRenamed(columnTemp, column0)
.select(column0, columnsRest: _*)
asdfDF
.join(asdfDF2,
asdfDF("id") === (asdfDF2("id") - 1)
, "inner")
.show()
Short altenative:
val asdfDF = Seq(1, 2, 3).toDF("id")
val asdfDF2 = asdfDF
.withColumn("id2", $"id")
.select($"id2".as("id"))
asdfDF
.join(asdfDF2,
asdfDF("id") === (asdfDF2("id") - 1)
, "inner")
.show()
This code produces the following output:
+---+----+---+----+
| id|text| id|text|
+---+----+---+----+
| 1| a| 2| b|
| 2| b| 3| c|
+---+----+---+----+