1

I want to append rows in a single string

+--------------------+
|   defectDescription|
+--------------------+
|ACEView NA : Daework|
|ACEView NA : Documen|
|ACEView NA : ACev   |
|ACEView NA : Dragdro|
+--------------------+

Expected Output: ACEView NA : Daework ACEView NA : Documen ACEView NA : ACev ACEView NA : Dragdro

2
  • 1
    Possible duplicate of Cannot print the contents of RDD Commented Feb 15, 2017 at 9:37
  • I tried your solution, its not working Commented Feb 15, 2017 at 9:58

4 Answers 4

10

If you indeed want to get all the data into a single string you can do it using collect:

val rows = df.select("defectDescription").collect().map(_.getString(0)).mkString(" ")

You first select the relevant column (so you have just it) and collect it, it would give you an array of rows. the map turns each row to the string (there is just one column - 0). Then mkString would make an overall string of them with a space as the separator.

Note that this would bring the entire dataframe to the driver which might cause memory exceptions. If you need just some of the data you can use take(n) instead of collect to limit the number of rows to n.

Sign up to request clarification or add additional context in comments.

4 Comments

I'm getting an output like this "[Ljava.lang.String;@5eeee124"
strange, it works for me. I created the dataframe val df = Seq("a","b","c","d").toDF("defectDescription") and it simply worked.
no need to collect and, indeed, if data is big enough, driver explodes
@juanchito The collect is critical here as the OP wanted everything in a SINGLE string. Of course, this will explode if too big but that is what the OP requested.
1
val str1 = df.select("defectDescription").collect.mkString(",")
val str =  str1.replaceAll("[\\[\\]]","")

Another way to do this is as follows:

The 1st line selects the particular columns then collects the subset, collects behaves as: Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

mkString - mkString method has an overloaded method which allows you to provide a delimiter to separate each element in the collection.

The 2nd line just replaces the additional brackets

Comments

1
df.createTempView(viewName="table")
val res=spark.sqlContext.sql(sqlText="select defectDescription from table").collectAsList.toString.replace("[", "").replace("]", "")

Initially create a temporary view of the dataframe, then convert into a list, and then string- Finally remove the brackets as per the required output.

Comments

0

Let's leverage parallel computing by not prematurely collecting the data while there is associative processing to be done:

def str(r: Row) = r.getString(0)
def cat(r0: Row, r1: Row) = Row(s"${str(r0)} ${str(r1)}")

str(df.select("defectDescription").reduce(cat))

This allows parallel concatenations to be done on all executors before concatenating their results in the driver.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.