Pyspark sort dataframe by expression

Question

I am currently reading Spark the definitive guide and there is an example to orderBy the DataFrame by using an expr but it does not work:

from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import Row

schema = StructType([
  StructField("origin", StringType(), True),
  StructField("destination", StringType(), True),
  StructField("count", LongType(), True)
])

rows = [
  Row("US", "Germany", 5),
  Row("US", "France", 1),
  Row("US", "UK", 10)
]

parallelizedRows = spark.sparkContext.parallelize(rows)
df = spark.createDataFrame(parallelizedRows, schema)

Now, in order to sort the DataFrame in descending order using expr,

df.orderBy(expr("count desc")).show(3)

The output is still in ascending. But it works using Column class:

df.orderBy(col("count").desc()).show(3)

Any idea why expr isn't working?

Dharman · Accepted Answer · 2020-07-27 14:11:08Z

If you're working in a sandbox environment, such as a notebook, try the following:

import pyspark.sql.functions as f

f.expr("count desc")

This will give you

Column<b'count AS `desc`'>

Which means that you're ordering by column count aliased as desc, essentially by f.col("count").alias("desc"). I am not sure why this functionality doesn't exist in expr(), but I believe it's because you have several other options to do this anyway, such as:

df.orderBy(f.col("count").desc())
df.orderBy(f.col("count"), ascending=False)
df.orderBy(f.desc("count"))

Each of which will return something along the lines of:

>>> f.desc("count")
Column<b'count DESC NULLS LAST'>

That being said, if you register your DataFrame as a table a run a sqlContext.sql(...) query against it, you'd be able to run an ANSI SQL query, with ORDER BY COUNT DESC; at the end and it will work.

As a side note, please try not to use from pyspark.sql.functions import * for two reasons:

It's never a good idea to import everything from a module if you can import the module under alias
In this specific case, you're importing stuff like pyspark.sql.functions.sum as sum which overrides native python library functions, leading to annoying and hard-to-debug errors in your code later down the line.

Collectives™ on Stack Overflow

Pyspark sort dataframe by expression

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related