9

How can a deep-copy of a DataFrame be requested - without resorting to a full re-computation of the original DataFrame contents?

The purpose will be in performing a self-join on a Spark Stream.

2 Answers 2

14

Dataframes are immutable. That means you don't have to do deep-copies, you can reuse them multiple times and on every operation new dataframe will be created and original will stay unmodified.

For example:

val df = List((1),(2),(3)).toDF("id")

val df1 = df.as("df1") //second dataframe
val df2 = df.as("df2") //third dataframe

df1.join(df2, $"df1.id" === $"df2.id") //fourth dataframe and df is still unmodified

It seems like a waste of resources, but since all data in dataframe is also immutable, then all four dataframes can reuse references to objects inside them.

Sign up to request clarification or add additional context in comments.

Comments

1

Common approach:

    val asdfDF = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "text")
    val columns = asdfDF.schema.fields.map(_.name).toSeq
    val column0 = columns.head
    val columnsRest = columns.tail
    val columnTemp = column0 + UUID.randomUUID().toString
    val asdfDF2 = asdfDF
      .withColumn(columnTemp, col(column0))
      .drop(column0)
      .withColumnRenamed(columnTemp, column0)
      .select(column0, columnsRest: _*)
    asdfDF
      .join(asdfDF2,
        asdfDF("id") === (asdfDF2("id") - 1)
        , "inner")
      .show()

Short altenative:

    val asdfDF = Seq(1, 2, 3).toDF("id")
    val asdfDF2 = asdfDF
      .withColumn("id2", $"id")
      .select($"id2".as("id"))
    asdfDF
      .join(asdfDF2,
        asdfDF("id") === (asdfDF2("id") - 1)
        , "inner")
      .show()

This code produces the following output:

+---+----+---+----+
| id|text| id|text|
+---+----+---+----+
|  1|   a|  2|   b|
|  2|   b|  3|   c|
+---+----+---+----+

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.