pyspark: join tables based on nested keys

Question

I have two tables with the below example schemas. The keys for table A are nested in a list in table B. I would like to join table A and table B based on the table A keys to generate table C. The values from table A should be a nested structure in table C based on the list of keyAs in table B. How can I do this using pyspark? Thanks!

Table A

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyA: string (nullable = true)

Table B

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true)

Table C

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true) 
|-- valueAs: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- item1: string (nullable = true) 
| | |-- item2: long (nullable = true) 
| | |-- keyA: string (nullable = true)

Mariusz · Accepted Answer · 2017-11-13 17:56:10Z

1

For joining A and B you need to explode B.keyAs first, like this:

tableB.withColumn('keyA', explode('keyAs')).join(tableA, 'keyA')

For creating a nested structure please see this answer

answered Nov 13, 2017 at 17:56

Mariusz

14k3 gold badges66 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark: join tables based on nested keys

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related