So I got an input pysaprk dataframe that looks like the following:
df = spark.createDataFrame(
[("1111", "[clark, john, silvie]"),
("2222", "[bob, charles, seth]"),
("3333", "[jane, luke, adam]"),
],
["column1", "column2"]
)
| column1 | column2 |
| ------- | ------- |
| 1111 | [clark kent, john, silvie] |
| 2222 | [bob, charles, seth rog] |
| 3333 | [jane, luke max, adam] |
And my goal is to convert the column and values from the column2 which is in StringType() to an ArrayType() of StringType().
But I have managed to only partially get the result converting it to ArrayType, but those values from the string list with more than one word are split separately, like the follow:
from pyspark.sql.functions import expr
df_out = df.withColumn('column2', expr(r"regexp_extract_all(column2, '(\\w+)', 1)"))
Which gets me something like (my regex skills aren't that good):
| column1 | column2 |
| ------- | ------- |
| 1111 | ["clark", "kent", "john", "silvie"] |
| 2222 | ["bob", "charles", "seth", "rog"] |
| 3333 | ["jane", "luke", "max", "adam"] |
But I'm actually looking to get something like:
| column1 | column2 |
| ------- | ------- |
| 1111 | ["clark kent", "john", "silvie"] |
| 2222 | ["bob", "charles", "seth rog"] |
| 3333 | ["jane", "luke max", "adam"] |