Pyspark - How to duplicate/triplicate rows?

Question

I need to "clone" or "duplicate"/"triplicate" every row from my dataframe.

I didn't find nothing about it, I just know that I need to use explode.

Example:

ID - Name
1     John
2     Maria
3     Charles

Output:

ID - Name
1     John
1     John
2     Maria
2     Maria
3     Charles
3     Charles

Thanks

Why don't you union the dataframe with itself?

cronoik
– cronoik

2020-05-04 21:54:01 +00:00
Commented May 4, 2020 at 21:54 — cronoik
– cronoik, Commented May 4, 2020 at 21:54

murtihash · Accepted Answer · 2020-05-05 20:04:48Z

You could use array_repeat with explode.(Spark2.4+)

For duplicate:

from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.array_repeat("Name",2)))

For triplicate:

df.withColumn("Name", F.explode(F.array_repeat("Name",3)))

For <spark2.4:

#duplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*2])))

#triplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*3])))

UPDATE:

In order to use another column Support to replicate a certain number of times for each row you could use this.(Spark2.4+)

df.show()

#+---+-------+-------+
#| ID|   Name|Support|
#+---+-------+-------+
#|  1|   John|      2|
#|  2|  Maria|      4|
#|  3|Charles|      6|
#+---+-------+-------+

from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(Support))"""))).show()

#+---+-------+-------+
#| ID|   Name|Support|
#+---+-------+-------+
#|  1|   John|      2|
#|  1|   John|      2|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#+---+-------+-------+

For spark1.5+, using repeat, concat, substring, split & explode.

from pyspark.sql import functions as F
df.withColumn("Name", F.expr("""repeat(concat(Name,','),Support)"""))\
  .withColumn("Name", F.explode(F.expr("""split(substring(Name,1,length(Name)-1),',')"""))).show()

Hey @Mohammad, do you know if it's possible to multiply the number of rows given a condition, for example there is a support column with number 2,4,6 and I'd like to explode accordingly to these numbers
whats your spark version? and support column with 2,4,6 meaning, replicate 2times,4times,6times right?
Meaning that I can't fix this parameter 2,4,6 it should be something that reads the column like df.withColumn("Name", F.explode(F.array_repeat("Name",F.col('parameter')))
@thalesthales did u check my update?. Only way to keep it dynamic like that is to use an expression, and send an int value of parameter like df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(parameter))""")))
Hi @murtihash - thank you for sharing this neat solution. What if I'd like to simply duplicate, triplicate, 4x, etc. ALL of the columns in a given dataframe? Seeing your 'UPDATE' example, I can think of adding another column with all of its values set to 2, 3 or 4 to duplicate/triplicate/quadruple all rows. But I'm wondering if there's a more elegant way (without having to add that new column). Alternatively, I can have a for loop for X amount of time and just do something like df_old = df.union(df_old), but that's also kind of not that clean. Thank you in advance for your answer!

Collectives™ on Stack Overflow

Pyspark - How to duplicate/triplicate rows?

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related