1

Here is a list of values, I would like my dataframe to have :

 cols=['USA','CAN','UK','DEN']

My current df:

| ID | USA | DEN | VEN | NOR|
| 98 |  1  |  0  | 1   |  1 |
| 99 |  0  |  1  | 0   |  0 |

I want to check if my existing df has all the values in the list as columns, if not then create those columns and fill then with 0 like:

| ID | USA | DEN | VEN | NOR| CAN | UK|
| 98 |  1  |  0  | 1   |  1 |  0  | 0 |
| 99 |  0  |  1  | 0   |  0 |  0  | 0 |

2 Answers 2

2

Try with for + if loop to check if column exists in df.columns or else add column with 0.

from pyspark.sql.functions import *

df=spark.createDataFrame([(98,1,0,1,1,)],['ID','USA','DEN','VEN','NOR'])
cols=['USA','CAN','UK','DEN']

for i in cols:
     if not i in df.columns:
        df=df.withColumn(i,lit("0"))

df.show()

#+---+---+---+---+---+---+---+
#| ID|USA|DEN|VEN|NOR|CAN| UK|
#+---+---+---+---+---+---+---+
#| 98|  1|  0|  1|  1|  0|  0|
#+---+---+---+---+---+---+---+
Sign up to request clarification or add additional context in comments.

2 Comments

For some reason, it doesn't create columns for all the missing elements in cols. It created only one extra column.
Could you post a test case which fails? I added 2 more items to cols and the solution above worked.
1

You can use a simple select expression :

from pyspark.sql.functions import lit

select_cols = df.columns + [lit(0).alias(c) for c in cols if c not in df.columns]

df.select(*select_cols).show()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.