I am trying to create a new dataframe with ArrayType() column, I tried with and without defining schema but couldn't get the desired result. My code below with schema
from pyspark.sql.types import *
l = [[1,2,3],[3,2,4],[6,8,9]]
schema = StructType([
StructField("data", ArrayType(IntegerType()), True)
])
df = spark.createDataFrame(l,schema)
df.show(truncate = False)
This gives error:
ValueError: Length of object (3) does not match with length of fields (1)
Desired output:
+---------+
|data |
+---------+
|[1,2,3] |
|[3,2,4] |
|[6,8,9] |
+---------+
Edit:
I found a strange thing(atleast for me):
if we use the following code, it gives the expected result:
import pyspark.sql.functions as f
data = [
('person', ['john', 'sam', 'jane']),
('pet', ['whiskers', 'rover', 'fido'])
]
df = spark.createDataFrame(data, ["type", "names"])
df.show(truncate=False)
This gives the following expected output:
+------+-----------------------+
|type |names |
+------+-----------------------+
|person|[john, sam, jane] |
|pet |[whiskers, rover, fido]|
+------+-----------------------+
But if we remove the first column, then it gives unexpected result.
import pyspark.sql.functions as f
data = [
(['john', 'sam', 'jane']),
(['whiskers', 'rover', 'fido'])
]
df = spark.createDataFrame(data, ["names"])
df.show(truncate=False)
This gives the following output:
+--------+-----+----+
|names |_2 |_3 |
+--------+-----+----+
|john |sam |jane|
|whiskers|rover|fido|
+--------+-----+----+
(['john', 'sam', 'jane'],)The coma makes the tuple, not the parenthesis.1,is a tuple.