Convert PySpark DataFrame column with list in StringType to ArrayType

Question

So I got an input pysaprk dataframe that looks like the following:

df = spark.createDataFrame(
    [("1111", "[clark, john, silvie]"),
     ("2222", "[bob, charles, seth]"),
     ("3333", "[jane, luke, adam]"),  
    ],
    ["column1", "column2"]
)

| column1 | column2 |
| ------- | ------- |
| 1111    | [clark kent, john, silvie] |
| 2222    | [bob, charles, seth rog]  |
| 3333    | [jane, luke max, adam]    |

And my goal is to convert the column and values from the column2 which is in StringType() to an ArrayType() of StringType().

But I have managed to only partially get the result converting it to ArrayType, but those values from the string list with more than one word are split separately, like the follow:

from pyspark.sql.functions import expr

df_out = df.withColumn('column2', expr(r"regexp_extract_all(column2, '(\\w+)', 1)"))

Which gets me something like (my regex skills aren't that good):

| column1 | column2 |
| ------- | ------- |
| 1111    | ["clark", "kent", "john", "silvie"] |
| 2222    | ["bob", "charles", "seth", "rog"]  |
| 3333    | ["jane", "luke", "max", "adam"]    |

But I'm actually looking to get something like:

| column1 | column2 |
| ------- | ------- |
| 1111    | ["clark kent", "john", "silvie"] |
| 2222    | ["bob", "charles", "seth rog"]  |
| 3333    | ["jane", "luke max", "adam"]    |

wwnde · Accepted Answer · 2022-12-16 21:27:14Z

Your output does not compare well with input. Anyway modified input. Let me know if this is what you want

Use translate to replacecorner brackets. split outcome with a comma

df = spark.createDataFrame(
    [("1111", "[clark kent, john, silvie]"),
     ("2222", "[bob, charles, seth rog]"),
     ("3333", "[jane, luke max, adam]"),  
    ],
    ["column1", "column2"]
)



df.withColumn('column2',split(translate('column2','[]',''),'\,')).show(truncate=False)


+-------+----------------------------+
|column1|column2                     |
+-------+----------------------------+
|1111   |[clark kent,  john,  silvie]|
|2222   |[bob,  charles,  seth rog]  |
|3333   |[jane,  luke max,  adam]    |
+-------+----------------------------+

Collectives™ on Stack Overflow

Convert PySpark DataFrame column with list in StringType to ArrayType

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related