I have a dataframe from which I need to create a new dataframe with a small change in the schema by doing the following operation.
>>> X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
>>> schema_new = X.schema.add('id_col', LongType(), False)
>>> _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new)
The problem is that in the above operation, the schema of X gets changed inplace. So when I print X.columns I get
>>> X.columns
['a', 'b', 'id_col']
but the values in X are still the same
>>> X.show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
To avoid changing the schema of X, I tried creating a copy of X using three ways
- using copy and deepcopy methods from the copy module
- simply using _X = X
The copy methods failed and returned a
RecursionError: maximum recursion depth exceeded
The assignment method also doesn't work
>>> _X = X
>>> id(_X) == id(X)
True
Since their id are the same, creating a duplicate dataframe doesn't really help here and the operations done on _X reflect in X.
So my question really is two fold
how to change the schema outplace (that is without making any changes to
X)?and more importantly, how to create a duplicate of a pyspark dataframe?
Note:
This question is a followup to this post
df_copy = original_df.select("*"), And maybe add some .cache() . Make it sense?