0

I have a table as below:

ID   String
1    a,b,c
2    b,c,a
3    c,a,b

I want to sort the String as a,b,c, so I can groupby ID and String, and ID 1,2,3 will be groupby together

is there any way to sort the multiple value in one string? like below

   ID     String     String2
    1      a,b,c      a,b,c
    2      b,c,a      a,b,c
    3      c,a,b      a,b,c

df2 = df.withColumn('String2', ','.join(sorted(df.String.split(',')))) is having errors, where it went wrong? 

Thanks to everyone who contribute this post, the correct code is posted in below

import pyspark.sql.functions as F
array_sort_udf = F.udf(sorted, 'array<string>')

df2 = df\
.withColumn("String2", F.concat_ws(",", array_sort_udf(F.split("String", ","))))

2 Answers 2

2

You can use combination of native SQL functions to achieve the task. The split function creates an array with the elements which can be sorted with array_sort. Then you can concatenate the values back together with concat_ws.

import pyspark.sql.functions as F
df = spark.createDataFrame([(1, "a,b,c"), (2, "b,c,a"), (3, "c,a,b")], ["ID", "String"])

df.withColumn("String2", F.concat_ws(",", F.array_sort(F.split("String", ",")))).show()

+---+------+-------+
| ID|String|String2|
+---+------+-------+
|  1| a,b,c|  a,b,c|
|  2| b,c,a|  a,b,c|
|  3| c,a,b|  a,b,c|
+---+------+-------+

Check out the pySpark API reference for more details.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, I edited a bit, since my spark did not have array_sort. but it work!
I see you used a UDF as a workaround to the array_sort function. UDF should only be used as last resource. If you have performance issues consider @mck's answer which shows a workaround using only native functions.
2

Another way without UDF:

from pyspark.sql import functions as F, Window

result = df.select(
    'ID',
    F.explode(F.split('String',',')).alias('String')
).withColumn(
    'String_list', 
    F.collect_list('String').over(
        Window.partitionBy('ID')
              .orderBy('String')
              .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
    )
).selectExpr(
    'ID',
    'String_list as String'
).distinct()

result.show()
+---+---------+
| ID|   String|
+---+---------+
|  3|[a, b, c]|
|  1|[a, b, c]|
|  2|[a, b, c]|
+---+---------+

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.