Create dataframe with arraytype column in pyspark

Question

I am trying to create a new dataframe with ArrayType() column, I tried with and without defining schema but couldn't get the desired result. My code below with schema

from pyspark.sql.types import *
l = [[1,2,3],[3,2,4],[6,8,9]]
schema = StructType([
  StructField("data", ArrayType(IntegerType()), True)
])
df = spark.createDataFrame(l,schema)
df.show(truncate = False)

This gives error:

ValueError: Length of object (3) does not match with length of fields (1)

Desired output:

+---------+
|data     |
+---------+
|[1,2,3]  |
|[3,2,4]  |
|[6,8,9]  |
+---------+

Edit:

I found a strange thing(atleast for me):

if we use the following code, it gives the expected result:

import pyspark.sql.functions as f
data = [
    ('person', ['john', 'sam', 'jane']),
    ('pet', ['whiskers', 'rover', 'fido'])
]

df = spark.createDataFrame(data, ["type", "names"])
df.show(truncate=False)

This gives the following expected output:

+------+-----------------------+
|type  |names                  |
+------+-----------------------+
|person|[john, sam, jane]      |
|pet   |[whiskers, rover, fido]|
+------+-----------------------+

But if we remove the first column, then it gives unexpected result.

import pyspark.sql.functions as f
data = [
    (['john', 'sam', 'jane']),
    (['whiskers', 'rover', 'fido'])
]

df = spark.createDataFrame(data, ["names"])
df.show(truncate=False)

This gives the following output:

+--------+-----+----+
|names   |_2   |_3  |
+--------+-----+----+
|john    |sam  |jane|
|whiskers|rover|fido|
+--------+-----+----+

to create a tuple with a single element, add a coma at the end. (['john', 'sam', 'jane'],) The coma makes the tuple, not the parenthesis. 1, is a tuple. — Steven
– Steven, Commented Sep 24, 2020 at 8:37

Grzegorz · Accepted Answer · 2020-09-24 08:23:09Z

I think you already have the answer to your question. Another solution is:

>>> l = [([1,2,3],), ([3,2,4],),([6,8,9],)]
>>> df = spark.createDataFrame(l, ['data'])
>>> df.show()

+---------+
|     data|
+---------+
|[1, 2, 3]|
|[3, 2, 4]|
|[6, 8, 9]|
+---------+

or

>>> from pyspark.sql.functions import array

>>> l = [[1,2,3],[3,2,4],[6,8,9]]
>>> df = spark.createDataFrame(l)
>>> df = df.withColumn('data',array(df.columns))
>>> df = df.select('data')
>>> df.show()
+---------+
|     data|
+---------+
|[1, 2, 3]|
|[3, 2, 4]|
|[6, 8, 9]|
+---------+

Regarding the strange thing, it is not that strange but you need to keep in mind that the tuple with a single value is the single value itself

>>> (['john', 'sam', 'jane'])
['john', 'sam', 'jane']

>>> type((['john', 'sam', 'jane']))
<class 'list'>

so the createDataFrame sees a list not the tuple.

So, createDataframe takes tuple for each row and a tuple is denoted by an end , . Did I got it right?
Yes, according to the documentation the comma is one way to construct a tuple: docs.python.org/3.3/library/stdtypes.html?highlight=tuple#tuple

raphaelauv · Accepted Answer · 2025-02-15 23:59:32Z

This his how you can build a pyspark dataframe containing a struct or list of struct

from pyspark.sql.types import Row
df = spark.createDataFrame(Row(events=[Row(a=278724874, b="toto")],id="toto"))

give

root
|-- events: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- a: long (nullable = true)
|    |    |-- b: string (nullable = true)
|-- id: string (nullable = true)

+-------------------+----+
|             events|  id|
+-------------------+----+
|[{278724874, toto}]|toto|

Collectives™ on Stack Overflow

Create dataframe with arraytype column in pyspark

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related