0

I have the following data and I want to split genre values in a way where I can query it later. As a first step I know how to split the columns, but the issue is is when I apply the split I only get one genre value and not all of them for each genre column

id,genre,rating
1,"lorem_1, lorem_2, lorem_3",5
1,"lorem_1, lorem_2, lorem_3, lorem_4, lorem_5",5
1,"lorem_1, lorem_2, lorem_3, lorem_4",5
1,"lorem_1, lorem_2, lorem_3, lorem_4, lorem_5",5
...

Prefered outcome

id,genre,rating
1,[lorem_1, lorem_2, lorem_3],5
1,[lorem_1, lorem_2, lorem_3, lorem_4, lorem_5],5
...

Or any other outcome that is easy to be queried

3 Answers 3

1

I guess you have a text file with the information provided in the question. I can suggest your two ways : 1) to use dataframe and split and 2) to use rdd and split.

1) dataframe way

import org.apache.spark.sql.functions._
val df = sqlContext
  .read
  .format("com.databricks.spark.csv")
  .option("header", true)
  .csv("path to your csv file")
  .withColumn("genre", split($"genre", ","))

You should have the following output

+---+-------------------------------------------------+------+
|id |genre                                            |rating|
+---+-------------------------------------------------+------+
|1  |[lorem_1,  lorem_2,  lorem_3]                    |5     |
|1  |[lorem_1,  lorem_2,  lorem_3,  lorem_4,  lorem_5]|5     |
|1  |[lorem_1,  lorem_2,  lorem_3,  lorem_4]          |5     |
|1  |[lorem_1,  lorem_2,  lorem_3,  lorem_4,  lorem_5]|5     |
+---+-------------------------------------------------+------+

2) rdd way

val rdd = sc
  .textFile("path to your csv file")
  .map(x => x.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)"))
  .map(x => (x(0), x(1).split(","), x(2)))

you should have the following output

(id,[genre],rating)
(1,[lorem_1,  lorem_2,  lorem_3],5)
(1,[lorem_1,  lorem_2,  lorem_3,  lorem_4,  lorem_5],5)
(1,[lorem_1,  lorem_2,  lorem_3,  lorem_4],5)
(1,[lorem_1,  lorem_2,  lorem_3,  lorem_4,  lorem_5],5)

I hope the answer is helpful

Sign up to request clarification or add additional context in comments.

Comments

1

Assuming the datatype of 'id,genre,rating' is List[(Int,String,Int)]

val a = List[(Int,String,Int)]() // Contains (id,genre,Rating)

the above can be converted to the required way in the following way

val b = a.map(x=>(x._1,x._2.split(","),x._3)) // datatype of b is List[(Int,Array[String],Int)]

Comments

1

The easiest way is using the split function of the DataFrame API:

val df2 = df.withColumn("genre", split($"genre", ", "))

Since you have a csv file, the data can be read as a dataframe as so:

val spark = SparkSession.builder.getOrCreate()
val df = spark.read
  .format("csv")
  .option("header", "true") //reading the headers
  .load("/path/to/csv")

After loading, the genre column can be split as described above. If you want to save as a csv file afterwards, then following command can be used:

df.write.format("csv").save("/path/to/save/csv")

Spark 2.x convention is used for both loading and saving to csv. Older versions relied on the spark-csv package but it is included in newer versions of Spark.

6 Comments

Is it possible to split genre while the data is in its native format?
@geek-tech What do you mean native format? Is it in a csv file? I assumed the data was in a dataframe as you have the apache-spark tag.
I've just got it from a .csv file (that's what I mean by native format) and as it is I want to achieve mentioned outcome ...
@geek-tech I see, you want to use apache-spark for this, then you can load the data as a dataframe and then save it after the transformation. The other alternative is to use pure Scala. In both cases you need to load the file and then save it.
Ok, then how can I read this file? I mean as a dataframe, so I can perform you piece of code ...
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.