3

I need to "clone" or "duplicate"/"triplicate" every row from my dataframe.

I didn't find nothing about it, I just know that I need to use explode.

Example:

ID - Name
1     John
2     Maria
3     Charles

Output:

ID - Name
1     John
1     John
2     Maria
2     Maria
3     Charles
3     Charles

Thanks

1
  • 2
    Why don't you union the dataframe with itself? Commented May 4, 2020 at 21:54

1 Answer 1

5

You could use array_repeat with explode.(Spark2.4+)

For duplicate:

from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.array_repeat("Name",2)))

For triplicate:

df.withColumn("Name", F.explode(F.array_repeat("Name",3)))

For <spark2.4:

#duplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*2])))

#triplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*3])))

UPDATE:

In order to use another column Support to replicate a certain number of times for each row you could use this.(Spark2.4+)

df.show()

#+---+-------+-------+
#| ID|   Name|Support|
#+---+-------+-------+
#|  1|   John|      2|
#|  2|  Maria|      4|
#|  3|Charles|      6|
#+---+-------+-------+

from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(Support))"""))).show()

#+---+-------+-------+
#| ID|   Name|Support|
#+---+-------+-------+
#|  1|   John|      2|
#|  1|   John|      2|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  2|  Maria|      4|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#|  3|Charles|      6|
#+---+-------+-------+

For spark1.5+, using repeat, concat, substring, split & explode.

from pyspark.sql import functions as F
df.withColumn("Name", F.expr("""repeat(concat(Name,','),Support)"""))\
  .withColumn("Name", F.explode(F.expr("""split(substring(Name,1,length(Name)-1),',')"""))).show()
Sign up to request clarification or add additional context in comments.

5 Comments

Hey @Mohammad, do you know if it's possible to multiply the number of rows given a condition, for example there is a support column with number 2,4,6 and I'd like to explode accordingly to these numbers
whats your spark version? and support column with 2,4,6 meaning, replicate 2times,4times,6times right?
Meaning that I can't fix this parameter 2,4,6 it should be something that reads the column like df.withColumn("Name", F.explode(F.array_repeat("Name",F.col('parameter')))
@thalesthales did u check my update?. Only way to keep it dynamic like that is to use an expression, and send an int value of parameter like df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(parameter))""")))
Hi @murtihash - thank you for sharing this neat solution. What if I'd like to simply duplicate, triplicate, 4x, etc. ALL of the columns in a given dataframe? Seeing your 'UPDATE' example, I can think of adding another column with all of its values set to 2, 3 or 4 to duplicate/triplicate/quadruple all rows. But I'm wondering if there's a more elegant way (without having to add that new column). Alternatively, I can have a for loop for X amount of time and just do something like df_old = df.union(df_old), but that's also kind of not that clean. Thank you in advance for your answer!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.