Convert string column to Array

Question

I have the following data and I want to split genre values in a way where I can query it later. As a first step I know how to split the columns, but the issue is is when I apply the split I only get one genre value and not all of them for each genre column

id,genre,rating
1,"lorem_1, lorem_2, lorem_3",5
1,"lorem_1, lorem_2, lorem_3, lorem_4, lorem_5",5
1,"lorem_1, lorem_2, lorem_3, lorem_4",5
1,"lorem_1, lorem_2, lorem_3, lorem_4, lorem_5",5
...

Prefered outcome

id,genre,rating
1,[lorem_1, lorem_2, lorem_3],5
1,[lorem_1, lorem_2, lorem_3, lorem_4, lorem_5],5
...

Or any other outcome that is easy to be queried

Anahcolus · Accepted Answer · 2017-10-19 12:36:28Z

I guess you have a text file with the information provided in the question. I can suggest your two ways : 1) to use dataframe and split and 2) to use rdd and split.

1) dataframe way

import org.apache.spark.sql.functions._
val df = sqlContext
  .read
  .format("com.databricks.spark.csv")
  .option("header", true)
  .csv("path to your csv file")
  .withColumn("genre", split($"genre", ","))

You should have the following output

+---+-------------------------------------------------+------+
|id |genre                                            |rating|
+---+-------------------------------------------------+------+
|1  |[lorem_1,  lorem_2,  lorem_3]                    |5     |
|1  |[lorem_1,  lorem_2,  lorem_3,  lorem_4,  lorem_5]|5     |
|1  |[lorem_1,  lorem_2,  lorem_3,  lorem_4]          |5     |
|1  |[lorem_1,  lorem_2,  lorem_3,  lorem_4,  lorem_5]|5     |
+---+-------------------------------------------------+------+

2) rdd way

val rdd = sc
  .textFile("path to your csv file")
  .map(x => x.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"))
  .map(x => (x(0), x(1).split(","), x(2)))

you should have the following output

(id,[genre],rating)
(1,[lorem_1,  lorem_2,  lorem_3],5)
(1,[lorem_1,  lorem_2,  lorem_3,  lorem_4,  lorem_5],5)
(1,[lorem_1,  lorem_2,  lorem_3,  lorem_4],5)
(1,[lorem_1,  lorem_2,  lorem_3,  lorem_4,  lorem_5],5)

I hope the answer is helpful

Guda uma shanker · Accepted Answer · 2017-10-19 08:23:27Z

1

Assuming the datatype of 'id,genre,rating' is List[(Int,String,Int)]

val a = List[(Int,String,Int)]() // Contains (id,genre,Rating)

the above can be converted to the required way in the following way

val b = a.map(x=>(x._1,x._2.split(","),x._3)) // datatype of b is List[(Int,Array[String],Int)]

answered Oct 19, 2017 at 8:23

Guda uma shanker

1821 silver badge13 bronze badges

Comments

Shaido · Accepted Answer · 2017-10-19 14:54:42Z

1

The easiest way is using the split function of the DataFrame API:

val df2 = df.withColumn("genre", split($"genre", ", "))

Since you have a csv file, the data can be read as a dataframe as so:

val spark = SparkSession.builder.getOrCreate()
val df = spark.read
  .format("csv")
  .option("header", "true") //reading the headers
  .load("/path/to/csv")

After loading, the genre column can be split as described above. If you want to save as a csv file afterwards, then following command can be used:

df.write.format("csv").save("/path/to/save/csv")

Spark 2.x convention is used for both loading and saving to csv. Older versions relied on the spark-csv package but it is included in newer versions of Spark.

edited Oct 19, 2017 at 14:54

answered Oct 19, 2017 at 8:30

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

6 Comments

geek-tech Over a year ago

Is it possible to split genre while the data is in its native format?

Shaido Over a year ago

@geek-tech What do you mean native format? Is it in a csv file? I assumed the data was in a dataframe as you have the apache-spark tag.

geek-tech Over a year ago

I've just got it from a .csv file (that's what I mean by native format) and as it is I want to achieve mentioned outcome ...

Shaido Over a year ago

@geek-tech I see, you want to use apache-spark for this, then you can load the data as a dataframe and then save it after the transformation. The other alternative is to use pure Scala. In both cases you need to load the file and then save it.

geek-tech Over a year ago

Ok, then how can I read this file? I mean as a dataframe, so I can perform you piece of code ...

|

Collectives™ on Stack Overflow

Convert string column to Array

3 Answers 3

Comments

Comments

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

6 Comments

Related