1

Let's suppose that we have the following dataframe schema

root
 |-- AUTHOR_ID: integer (nullable = false)
 |-- NAME: string (nullable = true)
 |-- Books: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- BOOK_ID: integer (nullable = false)
 |    |    |-- Chapters: array (nullable = true) 
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- NAME: string (nullable = true)
 |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)
  • How to find the authors that have books with NUMBER_PAGES < 100

Thanks

0

1 Answer 1

1

According to your data structure, the NUMBER_PAGES for a given BOOK_ID is equal to the sum of NUMBER_PAGES for each of its chapters.

You can use aggregate function to calculate the number of pages for each book then use filter with exists function:

from pyspark.sql import functions as F

df1 = df.filter(
    F.exists(
        "Books",
        lambda x: F.aggregate(x["Chapters"], F.lit(0), lambda a, b: a + b) < F.lit(100)
    )
)

For Spark <3.1, you need to use expr for aggregate and exists functions:

df1 = df.filter(    
    F.expr("exists(Book, x -> aggregate(x.Chapters, 0, (a, b) -> a + b) < 100)")
)
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.