1

I have a DataFrame with a lot of columns. Now I want to adjust the order of the columns.
A number of columns must come first (in a certain order) and the rest of the columns after them sorted by column name (not manually because there are many)

How can I achieve this using PySpark?

I guess sort them first and than adjust some in specific order

df.orderBy(cols, ascending=True)

Assume current column order:

col_a, col_k, col_c, col_h, col_e, col_f, col_g, col_d, col_j, col_i, col_b

Desired new order:

col_c, col_j, col_a, col_g :: col_b, col_d, col_e, col_f, col_h, col_i, col_k

Before :: is columns in specific order, after is remaining columns ordered by column name

1 Answer 1

4

You can select the first specific cols, sort the rest using python sorted then select in your df :

first_cols = ["col_c", "col_j", "col_a", "col_g"]
other_cols = sorted([c for c in df.columns if c not in first_cols], key=str.lower())

rearanged_cols = first_cols + other_cols

Then:

df = df.toDF(*rearanged_cols)

Or:

df = df.select(*rearanged_cols)
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.