1

I have a dataframe like this

data = [(('ID1', "[apples, mangos, eggs, milk, oranges]")),
   (('ID1', "[eggs, milk, cereals, mangos, apples]"))]
df = spark.createDataFrame(data, ['ID', "colval"])
df.show(truncate=False)
df.printSchema()

+---+-------------------------------------+
|ID |colval                               |
+---+-------------------------------------+
|ID1|[apples, mangos, eggs, milk, oranges]|
|ID1|[eggs, milk, cereals, mangos, apples]|
+---+-------------------------------------+

root
 |-- ID: string (nullable = true)
 |-- colval: string (nullable = true)

I want to convert colval to type Array of String

And when I take the first element after split, it returns me the same result as first. Any help?

root
 |-- ID: string (nullable = true)
 |-- colval: array (nullable = true)
 |    |-- element: string (containsNull = true)

I tried using split, however end up getting this result

df = df.withColumn('colval', split('colval', "', ?'"))
df.show(truncate = False)
df.printSchema()

+---+---------------------------------------+
|ID |colval                                 |
+---+---------------------------------------+
|ID1|[[apples, mangos, eggs, milk, oranges]]|
|ID1|[[eggs, milk, cereals, mangos, apples]]|
+---+---------------------------------------+

root
 |-- ID: string (nullable = true)
 |-- colval: array (nullable = true)
 |    |-- element: string (containsNull = true)

1 Answer 1

2

You can replace the [ and ] and then split:

df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),",")).show()

+---+-----------------------------------------+
|ID |colval                                   |
+---+-----------------------------------------+
|ID1|[apples,  mangos,  eggs,  milk,  oranges]|
|ID1|[eggs,  milk,  cereals,  mangos,  apples]|
+---+-----------------------------------------+


root
 |-- ID: string (nullable = true)
 |-- colval: array (nullable = true)
 |    |-- element: string (containsNull = true)

Incase you want to trim after splitting, you can use higher order functions after splitting :

(df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),","))
.withColumn("colval",F.expr("transform(colval,x-> trim(x))")))

verification and difference between approach 1 and 2 (Note extra spaces)

df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),",")).collect()
[Row(ID='ID1', colval=['apples', ' mangos', ' eggs', ' milk', ' oranges']),
 Row(ID='ID1', colval=['eggs', ' milk', ' cereals', ' mangos', ' apples'])]


(df.withColumn("colval",F.split(F.regexp_replace("colval",r"\[|\]",""),","))
 .withColumn("colval",F.expr("transform(colval,x-> trim(x))"))).collect()

[Row(ID='ID1', colval=['apples', 'mangos', 'eggs', 'milk', 'oranges']),
 Row(ID='ID1', colval=['eggs', 'milk', 'cereals', 'mangos', 'apples'])]
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.