0

I have created a pyspark dataframe as below:

df = spark.createDataFrame([([0.1,0.2], 2), ([0.1], 3), ([0.3,0.3,0.4], 2)], ("a", "b"))

df.show()

+---------------+---+
|              a|  b|
+---------------+---+
|     [0.1, 0.2]|  2|
|          [0.1]|  3|
|[0.3, 0.3, 0.4]|  2|
+---------------+---+

Now, i am trying to parse the column 'a' one row at a time as below:

parse_col = udf(lambda row: [ x for x in row.a], ArrayType(FloatType()))

new_df = df.withColumn("a_new", parse_col(struct([df[x] for x in df.columns if x == 'a'])))

new_df.show()

This works fine.

+---------------+---+---------------+
|              a|  b|          a_new|
+---------------+---+---------------+
|     [0.1, 0.2]|  2|     [0.1, 0.2]|
|          [0.1]|  3|          [0.1]|
|[0.3, 0.3, 0.4]|  2|[0.3, 0.3, 0.4]|
+---------------+---+---------------+

But when i try to format the values, as below:

count_empty_columns = udf(lambda row: ["{:.2f}".format(x) for x in row.a], ArrayType(FloatType()))

new_df = df.withColumn("a_new", count_empty_columns(struct([df[x] for x in df.columns if x == 'a'])))

new_df.show()

It's not working - the values are missing

+---------------+---+-----+
|              a|  b|a_new|
+---------------+---+-----+
|     [0.1, 0.2]|  2|  [,]|
|          [0.1]|  3|   []|
|[0.3, 0.3, 0.4]|  2| [,,]|
+---------------+---+-----+

I am using spark v2.3.1

Any idea what i am doing wrong here ?

Thanks

1 Answer 1

1

It is simple - types matter. You declare output as array<string>, while formatted string, is not a one. Hence the result is undefined. In other words being a string and a float is mutually exclusive.

If you wanted strings, you should declare column as such

udf(lambda row: ["{:.2f}".format(x) for x in row.a], "array<string>")

otherwise you should consider rounding or using fixed precision numbers.

df.select(df["a"].cast("array<decimal(38, 2)>")).show()
+------------------+                                                            
|                 a|
+------------------+
|      [0.10, 0.20]|
|            [0.10]|
|[0.30, 0.30, 0.40]|
+------------------+

though these are completely different operations.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.