0

I have two tables with the below example schemas. The keys for table A are nested in a list in table B. I would like to join table A and table B based on the table A keys to generate table C. The values from table A should be a nested structure in table C based on the list of keyAs in table B. How can I do this using pyspark? Thanks!

Table A

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyA: string (nullable = true) 

Table B

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true)

Table C

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true) 
|-- valueAs: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- item1: string (nullable = true) 
| | |-- item2: long (nullable = true) 
| | |-- keyA: string (nullable = true)

1 Answer 1

1

For joining A and B you need to explode B.keyAs first, like this:

tableB.withColumn('keyA', explode('keyAs')).join(tableA, 'keyA')

For creating a nested structure please see this answer

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.