0

I am trying to split an Dataframe into multiple arrays according to their id.

So I have a table

id|name
12|a
12|b
12|c
13|z
13|y
13|z

and I want to get multiple vectors that look like:

<a,b,c> <x,y,z> 

So I have managed to get all the different IDs using:

val ids=dataframe.select("id").distinct.collect.flatMap(_.toSeq)

and that would return 12 and 13. And I have tried to get for each one of them the names:

val namesArray=ids.map(id=>dataframe.where($"id"===id))

but that doesnt seem to be the way since it is returning the column names and I should find a way to get only the name out of it.

1 Answer 1

1

If you already have a DataSet with your data then you can do the following,

val yourDataSet = sc.parallelize(List((12, "a"), (12, "b"), (13, "y"), (13, "z"))).toDF("id", "val")

val requiredDataSet = yourDataSet
  .groupBy("id")
  .agg(collect_list("val"))
  .select("collect_list(val)")

Or you can achieve this by getting the underlying Rdd and then transforming it.

val vaueVectorRdd = dataframe.rdd
  .map(row.getInt(0), row.getString(1))
  .groupByKey({ case (k, v) => k })
  .map({ case (k, iter) => iter.map(_._2).toVector })
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @Saravesh Kumar Singh for your reply. collect_list is not being recognized by the compiler. What did you mean by that?
org.apache.spark.sql.functions.collect_list

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.