Spark: Convert column of string to an array

Question

How to convert a column that has been read as a string into a column of arrays? i.e. convert from below schema

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

To:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

Please share both scala and python implementation if possible. On a related note, how do I take care of it while reading from the file itself? I have data with ~450 columns and few of them I want to specify in this format. Currently I am reading in pyspark as below:

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

Thanks.

Ged · Accepted Answer · 2019-07-28 18:03:58Z

30

There are various method,

The best way to do is using split function and cast to array<long>

data.withColumn("b", split(col("b"), ",").cast("array<long>"))

You can also create simple udf to convert the values

val tolong = udf((value : String) => value.split(",").map(_.toLong))

data.withColumn("newB", tolong(data("b"))).show

Hope this helps!

edited Jul 28, 2019 at 18:03

Ged

18.5k8 gold badges53 silver badges108 bronze badges

answered Jun 22, 2017 at 4:40

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Defcon Over a year ago

Would anyone have an example of doing the reverse of this converting an array of strings to tab separated column?

Mohit Sharma Over a year ago

What if, the above 2,3 is a tuple (2,3) and then need to create an array @thebluephantom @ koiralo

koiralo Over a year ago

What do you mean by tuple? Can you share the schema? You could youse array() function to create a list from columns.

Ariana Bermúdez · Accepted Answer · 2018-04-24 16:30:12Z

3

In python (pyspark) it would be:

from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
        "b",
        split(col("b"), ",\s*").cast("array<int>").alias("ev")
 )

answered Apr 24, 2018 at 16:30

Ariana Bermúdez

311 bronze badge

Comments

himanshuIIITian · Accepted Answer · 2017-06-22 04:47:36Z

Using a UDF would give you exact required schema. Like this:

val toArray = udf((b: String) => b.split(",").map(_.toLong))

val test1 = test.withColumn("b", toArray(col("b")))

It would give you schema as follows:

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

As far as applying schema on file read itself is concerned, I think that is a tough task. So, for now you can apply transformation after creating DataFrameReader of test.

I hope this helps!

Collectives™ on Stack Overflow

Spark: Convert column of string to an array

3 Answers 3

3 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Linked

Related