Best way to get the max value in a Spark dataframe column

Question

I'm trying to figure out the best way to get the largest value in a Spark dataframe column.

Consider the following example:

df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
df.show()

Which creates:

+---+---+
|  A|  B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+

My goal is to find the largest value in column A (by inspection, this is 3.0). Using PySpark, here are four approaches I can think of:

# Method 1: Use describe()
float(df.describe("A").filter("summary = 'max'").select("A").first().asDict()['A'])

# Method 2: Use SQL
df.registerTempTable("df_table")
spark.sql("SELECT MAX(A) as maxval FROM df_table").first().asDict()['maxval']

# Method 3: Use groupby()
df.groupby().max('A').first().asDict()['max(A)']

# Method 4: Convert to RDD
df.select("A").rdd.max()[0]

Each of the above gives the right answer, but in the absence of a Spark profiling tool I can't tell which is best.

Any ideas from either intuition or empiricism on which of the above methods is most efficient in terms of Spark runtime or resource usage, or whether there is a more direct method than the ones above?

Methods 2 and 3 are equivalent and use identical physical and optimized logical plans. Method 4 applies reduce with max on rdd. It can be slower than operating directly on a DataFrame. Method 1 is more or less equivalent to 2 and 3. — zero323
– zero323, Commented Oct 19, 2015 at 22:33
@zero323 What about df.select(max("A")).collect()[0].asDict()['max(A)']? Looks equivalent to Method 2 while more compact, and also more intuitive that Method 3. — desertnaut
– desertnaut, Commented Nov 2, 2015 at 10:02
- The slowest is the method 4, because you do DF to RDD conversion of the whole column and then extract max value; — Danylo Zherebetskyy
– Danylo Zherebetskyy, Commented Feb 13, 2018 at 20:00

Burt · Accepted Answer · 2016-07-12 14:17:54Z

131

>df1.show()
+-----+--------------------+--------+----------+-----------+
|floor|           timestamp|     uid|         x|          y|
+-----+--------------------+--------+----------+-----------+
|    1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
|    1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
|    1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
|    1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|

>row1 = df1.agg({"x": "max"}).collect()[0]
>print row1
Row(max(x)=110.33613)
>print row1["max(x)"]
110.33613

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

edited Jul 12, 2016 at 14:17

answered Jul 12, 2016 at 13:08

Burt

1,6001 gold badge13 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

jibiel Over a year ago

can someone explain why collect()[0] is needed?

Jason Wolosonovich Over a year ago

@jibiel collect() returns a list (in this case with a single item), so you need to access the first (only) item in the list

Aliaxander Over a year ago

@Burt head() can be used instead if collect()[0].

Burt Over a year ago

@Aliaxander It's been a bit long. Don't have the code and Spark installed anymore.

Chris Koester Over a year ago

While .collect()[0] works, it's probably safer to use .first()[0]. By definition, collect() will "Return all the elements of the dataset as an array at the driver program.", which is a single machine. If you get the syntax wrong you could end up using an excessive amount of memory.

Hadij · Accepted Answer · 2022-11-17 16:21:40Z

101

Max value for a particular column of a dataframe can be achieved by using -

your_max_value = df.agg({"your-column": "max"}).collect()[0][0]

edited Nov 17, 2022 at 16:21

Hadij

4,8706 gold badges33 silver badges50 bronze badges

answered Sep 11, 2017 at 14:09

Rudra Prasad Samal

1,0761 gold badge7 silver badges2 bronze badges

1 Comment

omnisius Over a year ago

I prefer your solution to the accepted solution. Adding two "[0]" gives result only

ZygD · Accepted Answer · 2021-09-14 14:06:29Z

Remark: Spark is intended to work on Big Data - distributed computing. The size of the example DataFrame is very small, so the order of real-life examples can be altered with respect to the small example.

Slowest: Method_1, because .describe("A") calculates min, max, mean, stddev, and count (5 calculations over the whole column).

Medium: Method_4, because, .rdd (DF to RDD transformation) slows down the process.

Faster: Method_3 ~ Method_2 ~ Method_5, because the logic is very similar, so Spark's catalyst optimizer follows very similar logic with minimal number of operations (get max of a particular column, collect a single-value dataframe; .asDict() adds a little extra-time comparing 2, 3 vs. 5)

import pandas as pd
import time

time_dict = {}

dfff = self.spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
#--  For bigger/realistic dataframe just uncomment the following 3 lines
#lst = list(np.random.normal(0.0, 100.0, 100000))
#pdf = pd.DataFrame({'A': lst, 'B': lst, 'C': lst, 'D': lst})
#dfff = self.sqlContext.createDataFrame(pdf)

tic1 = int(round(time.time() * 1000))
# Method 1: Use describe()
max_val = float(dfff.describe("A").filter("summary = 'max'").select("A").collect()[0].asDict()['A'])
tac1 = int(round(time.time() * 1000))
time_dict['m1']= tac1 - tic1
print (max_val)

tic2 = int(round(time.time() * 1000))
# Method 2: Use SQL
dfff.registerTempTable("df_table")
max_val = self.sqlContext.sql("SELECT MAX(A) as maxval FROM df_table").collect()[0].asDict()['maxval']
tac2 = int(round(time.time() * 1000))
time_dict['m2']= tac2 - tic2
print (max_val)

tic3 = int(round(time.time() * 1000))
# Method 3: Use groupby()
max_val = dfff.groupby().max('A').collect()[0].asDict()['max(A)']
tac3 = int(round(time.time() * 1000))
time_dict['m3']= tac3 - tic3
print (max_val)

tic4 = int(round(time.time() * 1000))
# Method 4: Convert to RDD
max_val = dfff.select("A").rdd.max()[0]
tac4 = int(round(time.time() * 1000))
time_dict['m4']= tac4 - tic4
print (max_val)

tic5 = int(round(time.time() * 1000))
# Method 5: Use agg()
max_val = dfff.agg({"A": "max"}).collect()[0][0]
tac5 = int(round(time.time() * 1000))
time_dict['m5']= tac5 - tic5
print (max_val)

print time_dict

Result on an edge-node of a cluster in milliseconds (ms):

small DF (ms): {'m1': 7096, 'm2': 205, 'm3': 165, 'm4': 211, 'm5': 180}

bigger DF (ms): {'m1': 10260, 'm2': 452, 'm3': 465, 'm4': 916, 'm5': 373}

luminousmen · Accepted Answer · 2018-09-03 16:30:39Z

28

Another way of doing it:

df.select(f.max(f.col("A")).alias("MAX")).limit(1).collect()[0].MAX

On my data, I got this benchmarks:

df.select(f.max(f.col("A")).alias("MAX")).limit(1).collect()[0].MAX
CPU times: user 2.31 ms, sys: 3.31 ms, total: 5.62 ms
Wall time: 3.7 s

df.select("A").rdd.max()[0]
CPU times: user 23.2 ms, sys: 13.9 ms, total: 37.1 ms
Wall time: 10.3 s

df.agg({"A": "max"}).collect()[0][0]
CPU times: user 0 ns, sys: 4.77 ms, total: 4.77 ms
Wall time: 3.75 s

All of them give the same answer

answered Sep 3, 2018 at 16:30

luminousmen

2,1891 gold badge23 silver badges24 bronze badges

1 Comment

Chris H. Over a year ago

"df.limit(1).collect()[0]" can be replaced by "df.first()"

Nandeesh · Accepted Answer · 2019-05-16 13:51:13Z

The below example shows how to get the max value in a Spark dataframe column.

from pyspark.sql.functions import max

df = sql_context.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
df.show()
+---+---+
|  A|  B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+

result = df.select([max("A")]).show()
result.show()
+------+
|max(A)|
+------+
|   3.0|
+------+

print result.collect()[0]['max(A)']
3.0

Similarly min, mean, etc. can be calculated as shown below:

from pyspark.sql.functions import mean, min, max

result = df.select([mean("A"), min("A"), max("A")])
result.show()
+------+------+------+
|avg(A)|min(A)|max(A)|
+------+------+------+
|   2.0|   1.0|   3.0|
+------+------+------+

Agree. I'm new to pyspark (old to Python) and this is more intuitive.
extending on this answer - if you've NaN's following will work: df.select('A').dropna().select([max('A')])

tardis · Accepted Answer · 2020-02-06 15:20:55Z

9

First add the import line:

from pyspark.sql.functions import min, max

To find the min value of age in the dataframe:

df.agg(min("age")).show()

+--------+
|min(age)|
+--------+
|      29|
+--------+

To find the max value of age in the dataframe:

df.agg(max("age")).show()

+--------+
|max(age)|
+--------+
|      77|
+--------+

edited Feb 6, 2020 at 15:20

tardis

1,4004 gold badges27 silver badges52 bronze badges

answered Sep 30, 2019 at 5:48

satprem rath

1011 silver badge3 bronze badges

Comments

proutray · Accepted Answer · 2020-06-11 12:49:07Z

8

I used another solution (by @satprem rath) already present in this chain.

To find the min value of age in the dataframe:

df.agg(min("age")).show()

+--------+
|min(age)|
+--------+
|      29|
+--------+

edit: to add more context.

While the above method printed the result, I faced issues when assigning the result to a variable to reuse later.

Hence, to get only the int value assigned to a variable:

from pyspark.sql.functions import max, min  

maxValueA = df.agg(max("A")).collect()[0][0]
maxValueB = df.agg(max("B")).collect()[0][0]

edited Jun 11, 2020 at 12:49

answered May 5, 2020 at 20:51

proutray

2,0534 gold badges32 silver badges50 bronze badges

1 Comment

MegaIng Over a year ago

Please add a bit of context and explanation around your solution.

ZygD · Accepted Answer · 2021-09-14 13:59:39Z

5

To just get the value use any of these

df1.agg({"x": "max"}).collect()[0][0]
df1.agg({"x": "max"}).head()[0]
df1.agg({"x": "max"}).first()[0]

Alternatively we could do these for 'min'

from pyspark.sql.functions import min, max
df1.agg(min("id")).collect()[0][0]
df1.agg(min("id")).head()[0]
df1.agg(min("id")).first()[0]

edited Sep 14, 2021 at 13:59

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

answered Apr 18, 2020 at 6:50

Blue Clouds

8,3629 gold badges84 silver badges132 bronze badges

Comments

Boern · Accepted Answer · 2016-11-22 10:29:34Z

4

In case some wonders how to do it using Scala (using Spark 2.0.+), here you go:

scala> df.createOrReplaceTempView("TEMP_DF")
scala> val myMax = spark.sql("SELECT MAX(x) as maxval FROM TEMP_DF").
    collect()(0).getInt(0)
scala> print(myMax)
117

answered Nov 22, 2016 at 10:29

Boern

7,8225 gold badges61 silver badges90 bronze badges

Comments

Jean-François Corbett · Accepted Answer · 2019-01-14 08:35:51Z

3

I believe the best solution will be using head()

Considering your example:

+---+---+
|  A|  B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+

Using agg and max method of python we can get the value as following :

from pyspark.sql.functions import max df.agg(max(df.A)).head()[0]

This will return: 3.0

Make sure you have the correct import:
from pyspark.sql.functions import max The max function we use here is the pySPark sql library function, not the default max function of python.

edited Jan 14, 2019 at 8:35

Jean-François Corbett

38.7k30 gold badges144 silver badges192 bronze badges

answered Jul 6, 2018 at 19:17

Vyom Shrivastava

415 bronze badges

1 Comment

Vyom Shrivastava Over a year ago

Make sure you have the correct imports, You need to import the following: from pyspark.sql.functions import max The max we use here is the pySpark sql function not the python max It is better if you use use alias for it from pyspark.sql.functions import max as mx

Grant Shannon · Accepted Answer · 2019-02-04 20:01:50Z

1

in pyspark you can do this:

max(df.select('ColumnName').rdd.flatMap(lambda x: x).collect())

answered Feb 4, 2019 at 20:01

Grant Shannon

5,1132 gold badges51 silver badges39 bronze badges

Comments

user 923227 · Accepted Answer · 2018-09-12 18:58:44Z

0

Here is a lazy way of doing this, by just doing compute Statistics:

df.write.mode("overwrite").saveAsTable("sampleStats")
Query = "ANALYZE TABLE sampleStats COMPUTE STATISTICS FOR COLUMNS " + ','.join(df.columns)
spark.sql(Query)

df.describe('ColName')

or

spark.sql("Select * from sampleStats").describe('ColName')

or you can open a hive shell and

describe formatted table sampleStats;

You will see the statistics in the properties - min, max, distinct, nulls, etc.

edited Sep 12, 2018 at 18:58

answered Sep 12, 2018 at 4:11

user 923227

2,7654 gold badges30 silver badges49 bronze badges

Comments

hello-world · Accepted Answer · 2019-01-18 03:31:30Z

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val testDataFrame = Seq(
  (1.0, 4.0), (2.0, 5.0), (3.0, 6.0)
).toDF("A", "B")

val (maxA, maxB) = testDataFrame.select(max("A"), max("B"))
  .as[(Double, Double)]
  .first()
println(maxA, maxB)

And the result is (3.0,6.0), which is the same to the testDataFrame.agg(max($"A"), max($"B")).collect()(0).However, testDataFrame.agg(max($"A"), max($"B")).collect()(0) returns a List, [3.0,6.0]

Collectives™ on Stack Overflow

Best way to get the max value in a Spark dataframe column

13 Answers 13

5 Comments

1 Comment

Comments

1 Comment

2 Comments

To find the min value of age in the dataframe:

To find the max value of age in the dataframe:

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

5 Comments

1 Comment

Comments

1 Comment

2 Comments

To find the min value of age in the dataframe:

To find the max value of age in the dataframe:

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Linked

Related