0

I have the following Dataframe:

import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext('local')

df_pd = pd.DataFrame([[11, 'abc', 1, 114],
                      [11, 'abc', 2, 104],
                      [11, 'def', 9, 113],
                      [12, 'abc', 1,  14],
                      [12, 'def', 3, 110],
                      [14, 'abc', 1, 194],
                      [14, 'abc', 2, 164],
                      [14, 'abc', 3, 104],],
                      columns=['id', 'str', 'num', 'val'])

sql_sc = SQLContext(sc)

df_spark = sql_sc.createDataFrame(df_pd)
df_spark.show()

Which prints:

+---+---+---+---+
| id|str|num|val|
+---+---+---+---+
| 11|abc|  1|114|
| 11|abc|  2|104|
| 11|def|  9|113|
| 12|abc|  1| 14|
| 12|def|  3|110|
| 14|abc|  1|194|
| 14|abc|  2|164|
| 14|abc|  3|104|
+---+---+---+---+

My goal is to transform it to this:

+---+-----+-----+-----+-----+-----+
| id|abc_1|abc_2|abc_3|def_3|def_9|
+---+-----+-----+-----+-----+-----+
| 11|  114|  104|  NaN|  NaN|  113|
| 12|   14|  NaN|  NaN|  110|  NaN|
| 14|  194|  164|  104|  NaN|  NaN|
+---+-----+-----+-----+-----+-----+

(One row per id, colum names are str+'_'+str(val), the resulting table is filled with the respective vals, all other entries are NaN)

How would I achieve this? I started with

column = df_spark.select(concat(col("str"), lit("_"), col("num")))

by which I get the column names.

df_spark.select('id').distinct()

Gives the distinct ids

But I fail to build the new Dataframe or fill it.

Edit: The difference to the possible duplicate is that I didnt know about the pivot functionality, whereas the other question asked where to find the function "pivot" in pyspark. I dont know if thats a duplicate or not, but I didnt find the other question because I didnt know what to look for.

1

1 Answer 1

1

I am not sure which kind of aggregation you want to use for val field. I used sum and here is the solution

import pyspark.sql.functions as F

df_spark = df_spark.withColumn('col', F.concat(F.col("str"), F.lit("_"), F.col("num")))

df_spark.groupBy('id').pivot('col').agg({'val':'sum'}).orderBy('id').show()

+---+-----+-----+-----+-----+-----+
| id|abc_1|abc_2|abc_3|def_3|def_9|
+---+-----+-----+-----+-----+-----+
| 11|  114|  104| null| null|  113|
| 12|   14| null| null|  110| null|
| 14|  194|  164|  104| null| null|
+---+-----+-----+-----+-----+-----+
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much I will try it (there shouldnt be two entries with the same id, str and num).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.