Create dataframe from list in pyspark: ValueError

Question

I'm trying to convert a list into a dataframe in pyspark so that I can then join it onto a larger dataframe as a column. The data in the list are randomly generated names as so:

from faker import Faker
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import *

faker = Faker("en_GB")

list1 = [faker.first_name() for _ in range(0, 100)]
firstname = sc.parallelize([list1])

schema = StructType([
    StructField('FirstName', StringType(), True)
])

df = spark.createDataFrame(firstname, schema)

display(df)

But I'm getting this error:

PythonException: 'ValueError: Length of object (100) does not match with length of fields (1)'.

Any ideas on what's causing this and how to fix it appreciated!

Many thanks,

Carolina

Vincent Doba · Accepted Answer · 2021-11-25 09:05:18Z

You're getting a ValueError because you're passing a list with one element that is a list of 100 names to parallelize instead of passing a list of 100 elements, each element contains a list of one name.

If for instance Faker.first_name() returns 'John', then 'Henry', then 'Jade', etc..., your [list1] argument contains [['John', 'Henry', 'Jade', ...]].

When you pass such list to createDataFrame method, it tries to create a dataframe with one row having 100 columns. As your schema defines only one column, it fails.

Solution here is to either to create directly dataframe from list1 as in PApostol's answer, or to change how you build list1 so you have a list of 100 lists containing one name each instead of a list of one list of 100 names:

from faker import Faker
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import *

faker = Faker("en_GB")

list1 = [[faker.first_name()] for _ in range(0, 100)]
firstname = sc.parallelize(list1)

schema = StructType([
    StructField('FirstName', StringType(), True)
])

df = spark.createDataFrame(firstname, schema)

display(df)

PApostol · Accepted Answer · 2021-11-24 18:35:34Z

0

This is probably because pyspark tries to create a dataframe with 100 columns (the length of firstname) but you're only providing one column in your schema. Try without parallelize:

list1 = [faker.first_name() for _ in range(0, 100)]
df = spark.createDataFrame(list1, schema)

or if you do want to parallelize, try:

from pyspark.sql import Row

list1 = [faker.first_name() for _ in range(0, 100)]
firstname = sc.parallelize([list1])

firstname_row = firstname.map(lambda x: Row(x))
df = spark.createDataFrame(firstname_row, schema)

answered Nov 24, 2021 at 18:35

PApostol

2,3122 gold badges17 silver badges24 bronze badges

1 Comment

Carolina Karoullas Over a year ago

Hi thanks for your response! This works but I end up with all the values in one row. I would like each of the 100 values spread over 100 rows. Any ideas on how to achieve this? Thanks :)

Collectives™ on Stack Overflow

Create dataframe from list in pyspark: ValueError

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related