Rearranging columns in PySpark

Question

I have a DataFrame with a lot of columns. Now I want to adjust the order of the columns.
A number of columns must come first (in a certain order) and the rest of the columns after them sorted by column name (not manually because there are many)

How can I achieve this using PySpark?

I guess sort them first and than adjust some in specific order

df.orderBy(cols, ascending=True)

Assume current column order:

col_a, col_k, col_c, col_h, col_e, col_f, col_g, col_d, col_j, col_i, col_b

Desired new order:

col_c, col_j, col_a, col_g :: col_b, col_d, col_e, col_f, col_h, col_i, col_k

Before :: is columns in specific order, after is remaining columns ordered by column name

blackbishop · Accepted Answer · 2021-02-06 07:32:47Z

4

You can select the first specific cols, sort the rest using python sorted then select in your df :

first_cols = ["col_c", "col_j", "col_a", "col_g"]
other_cols = sorted([c for c in df.columns if c not in first_cols], key=str.lower())

rearanged_cols = first_cols + other_cols

Then:

df = df.toDF(*rearanged_cols)

Or:

df = df.select(*rearanged_cols)

answered Feb 6, 2021 at 7:32

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Rearranging columns in PySpark

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related