1

I'm trying to convert a list into a dataframe in pyspark so that I can then join it onto a larger dataframe as a column. The data in the list are randomly generated names as so:

from faker import Faker
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import *

faker = Faker("en_GB")

list1 = [faker.first_name() for _ in range(0, 100)]
firstname = sc.parallelize([list1])

schema = StructType([
    StructField('FirstName', StringType(), True)
])

df = spark.createDataFrame(firstname, schema)

display(df)

But I'm getting this error:

PythonException: 'ValueError: Length of object (100) does not match with length of fields (1)'.

Any ideas on what's causing this and how to fix it appreciated!

Many thanks,

Carolina

2 Answers 2

1

You're getting a ValueError because you're passing a list with one element that is a list of 100 names to parallelize instead of passing a list of 100 elements, each element contains a list of one name.

If for instance Faker.first_name() returns 'John', then 'Henry', then 'Jade', etc..., your [list1] argument contains [['John', 'Henry', 'Jade', ...]].

When you pass such list to createDataFrame method, it tries to create a dataframe with one row having 100 columns. As your schema defines only one column, it fails.

Solution here is to either to create directly dataframe from list1 as in PApostol's answer, or to change how you build list1 so you have a list of 100 lists containing one name each instead of a list of one list of 100 names:

from faker import Faker
from pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql.types import *

faker = Faker("en_GB")

list1 = [[faker.first_name()] for _ in range(0, 100)]
firstname = sc.parallelize(list1)

schema = StructType([
    StructField('FirstName', StringType(), True)
])

df = spark.createDataFrame(firstname, schema)

display(df)
Sign up to request clarification or add additional context in comments.

1 Comment

This worked perfectly and makes sense! Thank you! :)
0

This is probably because pyspark tries to create a dataframe with 100 columns (the length of firstname) but you're only providing one column in your schema. Try without parallelize:

list1 = [faker.first_name() for _ in range(0, 100)]
df = spark.createDataFrame(list1, schema)

or if you do want to parallelize, try:

from pyspark.sql import Row

list1 = [faker.first_name() for _ in range(0, 100)]
firstname = sc.parallelize([list1])

firstname_row = firstname.map(lambda x: Row(x))
df = spark.createDataFrame(firstname_row, schema)

1 Comment

Hi thanks for your response! This works but I end up with all the values in one row. I would like each of the 100 values spread over 100 rows. Any ideas on how to achieve this? Thanks :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.