3

I have a raw Dataframe pyspark with encapsulate column. I need to loop on all columns to unwrap those columns. I don't know name columns and they could change. So I need generic algorithm. The problem is that I can't use classic loop (for) because I need a paralleled code.

Example of Data:

Timestamp | Layers
1456982   | [[1, 2],[3,4]]
1486542   | [[3,5], [5,5]]

In layers, it's a column which contain other columns (with their own column names). My goal is to have something like this:

Timestamp | label | number1 | text | value
1456982   | 1     | 2       |3     |4
1486542   | 3     | 5       |5     |5

How can I make a loop on columns with pyspark function?

Thanks for advice

1 Answer 1

3

You can use reduce function to this. I dont know what you want to do but lets suppose you wanna add 1 to all columns:

from functools import reduce
from pyspark.sql import functions as F

def add_1(df, col_name):
    return df.withColumn(col_name, F.col(col_name)+1) # using same column name will update column

reduce(add_1, df.columns, df)

Edit: I am not sure about solving it without converting rdd. Maybe this can be helpful:

from pyspark.sql import Row

flatF = lambda col: [item for item in l for l in col]
df \
    .rdd \
    .map(row: Row(timestamp=row['timestamp'],
          **dict(zip(col_names, flatF(row['layers']))))) \
    .toDF()
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.