How to create a copy of a dataframe in pyspark?

Question

I have a dataframe from which I need to create a new dataframe with a small change in the schema by doing the following operation.

>>> X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
>>> schema_new = X.schema.add('id_col', LongType(), False)
>>> _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)

The problem is that in the above operation, the schema of X gets changed inplace. So when I print X.columns I get

>>> X.columns
['a', 'b', 'id_col']

but the values in X are still the same

>>> X.show()
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

To avoid changing the schema of X, I tried creating a copy of X using three ways - using copy and deepcopy methods from the copy module - simply using _X = X

The copy methods failed and returned a

RecursionError: maximum recursion depth exceeded

The assignment method also doesn't work

>>> _X = X
>>> id(_X) == id(X)
True

Since their id are the same, creating a duplicate dataframe doesn't really help here and the operations done on _X reflect in X.

So my question really is two fold

how to change the schema outplace (that is without making any changes to X)?
and more importantly, how to create a duplicate of a pyspark dataframe?

Note:

This question is a followup to this post

The simplest solution that comes to my mind is using a work around with df_copy = original_df.select("*"), And maybe add some .cache() . Make it sense? — SantiagoRodriguez
– SantiagoRodriguez, Commented Jan 23, 2019 at 9:15

tozCSS · Accepted Answer · 2021-03-03 19:15:29Z

35

.alias() is commonly used in renaming the columns, but it is also a DataFrame method and will give you what you want:

df2 = df.alias('df2')
id(df2) == id(df)  # False

answered Mar 3, 2021 at 19:15

tozCSS

6,2343 gold badges36 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

tozCSS Over a year ago

@GuillaumeLabs can you please tell your spark version and what error you got.

OrganicMustard Over a year ago

I'm using azure databricks 6.4 . The Ids of dataframe are different but because initial dataframe was a select of a delta table, the copy of this dataframe with your trick is still a select of this delta table ;-) .

Harlan Nelson Over a year ago

Try reading from a table, making a copy, then writing that copy back to the source location. "Cannot overwrite table." So this solution might not be perfect.

Florian · Accepted Answer · 2018-09-12 06:56:47Z

As explained in the answer to the other question, you could make a deepcopy of your initial schema. We can then modify that copy and use it to initialize the new DataFrame _X:

import pyspark.sql.functions as F
from pyspark.sql.types import LongType
import copy

X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
_schema = copy.deepcopy(X.schema)
_schema.add('id_col', LongType(), False) # modified inplace
_X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(_schema)

Now let's check:

print('Schema of X: ' + str(X.schema))
print('Schema of _X: ' + str(_X.schema))

Output:

Schema of X: StructType(List(StructField(a,LongType,true),StructField(b,LongType,true)))
Schema of _X: StructType(List(StructField(a,LongType,true),
                  StructField(b,LongType,true),StructField(id_col,LongType,false)))

Note that to copy a DataFrame you can just use _X = X. Whenever you add a new column with e.g. withColumn, the object is not altered in place, but a new copy is returned. Hope this helps!

dougissi · Accepted Answer · 2022-01-21 01:40:35Z

5

df2 = df.select("*")
id(df2) = id(df)  # False

This is identical to the answer given by @SantiagoRodriguez, and likewise represents a similar approach to what @tozCSS shared. I believe @tozCSS's suggestion of using .alias() in place of .select() may indeed be the most efficient.

answered Jan 21, 2022 at 1:40

dougissi

1312 silver badges4 bronze badges

Comments

OrganicMustard · Accepted Answer · 2023-01-06 11:00:17Z

2

If you need to create a copy of a pyspark dataframe, you could potentially use Pandas (if your use case allows it).

schema = X.schema
X_pd = X.toPandas()
_X = spark.createDataFrame(X_pd,schema=schema)
del X_pd

edited Jan 6, 2023 at 11:00

answered Mar 7, 2021 at 21:07

OrganicMustard

1,4362 gold badges25 silver badges48 bronze badges

Comments

pasha701 · Accepted Answer · 2018-09-12 06:55:41Z

1

In Scala:

With "X.schema.copy" new schema instance created without old schema modification;
In each Dataframe operation, which return Dataframe ("select","where", etc), new Dataframe is created, without modification of original. Original can be used again and again. Guess, duplication is not required for yours case. Performance is separate issue, "persist" can be used.

answered Sep 12, 2018 at 6:55

pasha701

7,2171 gold badge17 silver badges22 bronze badges

Comments

Shubham Tomar · Accepted Answer · 2023-04-08 19:01:09Z

1

To create a Deep copy of a PySpark DataFrame, you can use the rdd method to extract the data as an RDD, and then create a new DataFrame from the RDD.

df_deep_copied = spark.createDataFrame(df_original.rdd.map(lambda x: x), schema=df_original.schema)

Note: This method can be memory-intensive, so use it judiciously.

answered Apr 8, 2023 at 19:01

Shubham Tomar

3723 silver badges10 bronze badges

Comments

human · Accepted Answer · 2024-11-05 23:35:01Z

0

Why not a simple 1 liner df1 = df.select('*')

answered Nov 5, 2024 at 23:35

human

2,47924 silver badges25 bronze badges

Collectives™ on Stack Overflow

How to create a copy of a dataframe in pyspark?

7 Answers 7

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related