Pyspark parallelized loop of dataframe column

Question

I have a raw Dataframe pyspark with encapsulate column. I need to loop on all columns to unwrap those columns. I don't know name columns and they could change. So I need generic algorithm. The problem is that I can't use classic loop (for) because I need a paralleled code.

Example of Data:

Timestamp | Layers
1456982   | [[1, 2],[3,4]]
1486542   | [[3,5], [5,5]]

In layers, it's a column which contain other columns (with their own column names). My goal is to have something like this:

Timestamp | label | number1 | text | value
1456982   | 1     | 2       |3     |4
1486542   | 3     | 5       |5     |5

How can I make a loop on columns with pyspark function?

Thanks for advice

hamza tuna · Accepted Answer · 2018-06-25 17:11:27Z

3

You can use reduce function to this. I dont know what you want to do but lets suppose you wanna add 1 to all columns:

from functools import reduce
from pyspark.sql import functions as F

def add_1(df, col_name):
    return df.withColumn(col_name, F.col(col_name)+1) # using same column name will update column

reduce(add_1, df.columns, df)

Edit: I am not sure about solving it without converting rdd. Maybe this can be helpful:

from pyspark.sql import Row

flatF = lambda col: [item for item in l for l in col]
df \
    .rdd \
    .map(row: Row(timestamp=row['timestamp'],
          **dict(zip(col_names, flatF(row['layers']))))) \
    .toDF()

edited Jun 25, 2018 at 17:11

answered Jun 25, 2018 at 13:17

hamza tuna

1,4971 gold badge13 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark parallelized loop of dataframe column

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related