Clone/Deep-Copy a Spark DataFrame

Question

How can a deep-copy of a DataFrame be requested - without resorting to a full re-computation of the original DataFrame contents?

The purpose will be in performing a self-join on a Spark Stream.

Krzysztof Atłasik · Accepted Answer · 2019-07-15 21:19:54Z

14

Dataframes are immutable. That means you don't have to do deep-copies, you can reuse them multiple times and on every operation new dataframe will be created and original will stay unmodified.

For example:

val df = List((1),(2),(3)).toDF("id")

val df1 = df.as("df1") //second dataframe
val df2 = df.as("df2") //third dataframe

df1.join(df2, $"df1.id" === $"df2.id") //fourth dataframe and df is still unmodified

It seems like a waste of resources, but since all data in dataframe is also immutable, then all four dataframes can reuse references to objects inside them.

edited Jul 15, 2019 at 21:19

answered Jul 15, 2019 at 21:07

Krzysztof Atłasik

22.8k6 gold badges57 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

StanislavKo · Accepted Answer · 2020-01-28 11:43:43Z

Common approach:

    val asdfDF = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "text")
    val columns = asdfDF.schema.fields.map(_.name).toSeq
    val column0 = columns.head
    val columnsRest = columns.tail
    val columnTemp = column0 + UUID.randomUUID().toString
    val asdfDF2 = asdfDF
      .withColumn(columnTemp, col(column0))
      .drop(column0)
      .withColumnRenamed(columnTemp, column0)
      .select(column0, columnsRest: _*)
    asdfDF
      .join(asdfDF2,
        asdfDF("id") === (asdfDF2("id") - 1)
        , "inner")
      .show()

Short altenative:

    val asdfDF = Seq(1, 2, 3).toDF("id")
    val asdfDF2 = asdfDF
      .withColumn("id2", $"id")
      .select($"id2".as("id"))
    asdfDF
      .join(asdfDF2,
        asdfDF("id") === (asdfDF2("id") - 1)
        , "inner")
      .show()

This code produces the following output:

+---+----+---+----+
| id|text| id|text|
+---+----+---+----+
|  1|   a|  2|   b|
|  2|   b|  3|   c|
+---+----+---+----+

Collectives™ on Stack Overflow

Clone/Deep-Copy a Spark DataFrame

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related