Pyspark: Converting a sample to Pandas Dataframe

Question

I trying to extract a sample from a dataframe (df_spark) with 100 million rows and converting it to a pandas dataframe using the below code:

df = df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11).collect().toPandas()

Unfortunately, I'm getting the following error:

AttributeError: 'list' object has no attribute 'toPandas'

I also tried to convert it to rdd and then to pandas and got the same error.

I'm wondering to know once I have the sample list what is the correct method to convert it to a pandas dataframe or a spark dataframe?

Pengshe · Accepted Answer · 2022-01-26 09:53:38Z

1

There is no need to call collect() here. The sample() function returns a DataFrame object and the code can be as simple as:

df = df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11).toPandas()

answered Jan 26, 2022 at 9:17

Pengshe

4466 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

danimille · Accepted Answer · 2022-01-25 17:24:51Z

0

I solve this issue first converting the sample to rdd, then to spark.DataFrame and last converting to Pandas as code below:

df = (df_spark.sample(withReplacement = False, fraction = 0.05, seed = 11)
              .rdd
              .toDF()
              .toPandas())

answered Jan 25, 2022 at 17:24

danimille

3502 silver badges14 bronze badges