How is Spark's "exploding" of array/map fields a SELECT operation?

Ask Question

Asked 2 years ago

Modified 2 years ago

Viewed 4k times

I am new to Python a Spark, currently working through this tutorial on Spark's explode operation for array/map fields of a DataFrame.

Based on the very first section 1 (PySpark explode array or map column to rows), it's very intuitive. The minimum working example DataFrame is created the Annex below. The schema and DataFrame table are:

>>> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- knownLanguages: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

>>> df.show(truncate=False)
+----------+-------------------+-----------------------------+
|name      |knownLanguages     |properties                   |
+----------+-------------------+-----------------------------+
|James     |[Java, Scala]      |{eye -> brown, hair -> black}|
|Michael   |[Spark, Java, null]|{eye -> null, hair -> brown} |
|Robert    |[CSharp, ]         |{eye -> , hair -> red}       |
|Washington|null               |null                         |
|Jefferson |[1, 2]             |{}                           |
+----------+-------------------+-----------------------------+

The explode function is illustrated as follows:

>>> df \
... .select(df.name,explode(df.knownLanguages)) \
... .show()
+---------+------+
|name     |col   |
+---------+------+
|James    |Java  |
|James    |Scala |
|Michael  |Spark |
|Michael  |Java  |
|Michael  |null  |
|Robert   |CSharp|
|Robert   |      |
|Jefferson|1     |
|Jefferson|2     |
+---------+------+

The explode function is shown in the context of a SELECT query, however, which I find to be very unintuitive. SELECT prunes away rows and never increases the height of a data frame. Only joins potentially increase height, but even there, the filtering of rows is applied to a Cartesian join [1], so is still a potential reduction in height rather than an increase. Correct me if I'm wrong, but the above SELECT is not being applied to a join, since it is invoked as a method of DataFrame df.

I tried to better see how explode fits SELECT through the latter's doc string: "Projects a set of expressions and returns a new :class:DataFrame". Projection refers to choosing column expressions. I unsuccessfully tried to get insight into how the above explode code fits SELECTion by examining its content:

explode(df.knownLanguages) # Shows no columnar data
Out[114]: Column<'explode(knownLanguages)'>

Later, I found that it is not possible to examine the columnar data content of Column object, as described here.

The prototype for explode returns a Column object while the doc string says that it "Returns a new row for each element in the given array or map". It's difficult to picture such column, as there is no "given array or map" -- there are as many heterogenous arrays/maps as there are records in DataFrame df.

Even if we accept that the Column object doesn't contain such columnar data, it's necessary to picture how such a column would be conceptually constructed in order to see how the SELECT makes sense. I can't come up with such a column of data that would make sense in the SELECT query because no matter how the explode column is constructed, it will be of a different height than DataFrame df.

Would it be correct to conclude that explode() can yield no column expression that would fit SELECT's projection/selection operation as applied DataFrame df, and that it is simply a signal to the select() method to create a new DataFrame by replicating each record $i$ by $n_i$ times, where $n_i$ is the number of items in the record's array/map?

I'm just starting to find may way around Spark. I anticipate, however, that if explode() breaks the projection/selection model of SELECT, it may be difficult to craft more complex queries than in the tutorial based on knowledge of designed-for behaviour.

Notes

[1] SELECT filters a Cartesian join in concept, though of course, not in execution. This is reflected by the fact that early SQL used WHERE in place of ON. All the WHERE clauses are (conceptually) applied to a Cartesian join.

Annex: Create minimum working example DataFrame table

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('pyspark-by-examples').getOrCreate()

arrayData = [
        ('James',['Java','Scala'],{'hair':'black','eye':'brown'}),
        ('Michael',['Spark','Java',None],{'hair':'brown','eye':None}),
        ('Robert',['CSharp',''],{'hair':'red','eye':''}),
        ('Washington',None,None),
        ('Jefferson',['1','2'],{})
]

df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages','properties'])

edited Oct 4, 2023 at 16:40

asked Oct 2, 2023 at 23:34

user2153235

1,26511 silver badges23 bronze badges

explode is not a select operation, but a column operation that returns a new column. as in doc - pyspark.sql.functions.explode(col: ColumnOrName) → pyspark.sql.column.Column. you can use it in withColumn. all of this is a projection because a new column is generated.

samkart
– samkart

2023-10-04 08:10:52 +00:00
Commented Oct 4, 2023 at 8:10
@samkart: I thought that projection is taking a set of points in $n$-dimensional space and smooshing them into a subspace represented by some of the $n$ dimensions, but not all of them. For example, taking a set of points in a 3D space with axes $x, y,$ and $z$ and flatteng them onto the $xy$ plane by dropping the $z$ coordinates. That's what happens when you SELECT some columns from a relational data table (see here). When you say "all of this is a projection", what is the relational data table to which it is being applied?

user2153235
– user2153235

2023-10-04 16:11:46 +00:00
Commented Oct 4, 2023 at 16:11
It seems like the SELECT projection is applied to DataFrame df because it is being invoked as its method, but that can't be right because the explode() column doesn't even match the height of df. A SELECT operation in relational algebra specifies columns from a single relational data table, and it only makes sense if the columns are the same height. Even when selecting from a JOIN, the SELECTion is being done from the single table that results from the JOIN. The DataFrame.select() method is described as a projection, so this concept of projection should still apply.

user2153235
– user2153235

2023-10-04 16:26:50 +00:00
Commented Oct 4, 2023 at 16:26
the explain() method of dataframe can help you somewhat. try it

samkart
– samkart

2023-10-04 16:34:19 +00:00
Commented Oct 4, 2023 at 16:34
I believe you're overthinking here. "projection", here, means selecting columns (or projecting the dataframe with few or more columns). i don't believe it has anything to do with algebra.

samkart
– samkart

2023-10-04 16:43:24 +00:00
Commented Oct 4, 2023 at 16:43

| Show 2 more comments

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How is Spark's "exploding" of array/map fields a SELECT operation?

Notes

Annex: Create minimum working example DataFrame table

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Notes

Annex: Create minimum working example DataFrame table

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked