2

I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. Thanks for helping.

df.set_index('colx',drop=False,inplace=True)
# Ordena index
df.sort_index(inplace=True)

I expect the result to be a dataframe with 'colx' as index.

2

2 Answers 2

1

This is not how it works with Spark. No such concept exists.

One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.

Sign up to request clarification or add additional context in comments.

Comments

1

add index to pyspark dataframe as a column and use it

rdd_df = df.rdd.zipWithIndex()
df_index = rdd_df.toDF()
#and extract the columns
df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA"))
df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.