5

I use DataFrame API.

I have existing DataFrame and a List object (can also use Array). How is it possible to add this List to existing DataFrame as a new column? Should I use the class Column for this?

4 Answers 4

5

You should probably convert your List to a single Column RDD and apply join on critetia pickeg by you. Simple DataFrame conversion:

 val df1 = sparkContext.makeRDD(yourList).toDF("newColumn")

If you need to create additional column to perform join on you can add more columns, mapping your list:

val df1 = sparkContext.makeRDD(yourList).map(i => (i, fun(i)).toDF("newColumn", "joinOnThisColumn")

I am not familiar with Java version, but you should try using JavaSparkContext.parallelize(yourList) and apply similar mapping operations based on this doc.

Sign up to request clarification or add additional context in comments.

1 Comment

ok, thanks, I will try your solution. However I have found also some function in API Java, not Scala. Thank you a lot.
2

Sorry, It was my fault, I already found the function withColumn(String colName, Column col) which should solve my problem

7 Comments

The only problem with withColumn is it will be difficult to pick up elements from your list sequentially and add them to selected rows. If you have a way to do this, it is probably better this way but your question is to generic to tell ;)
Why, I will convert my List first of all to Column object and add it like second function argument. Is it not ok?...
Interesting. Please post how you did it after you finish.
@Niemand, the main problem is a cast (conversion) from ArrayList to Column object, is not going.. Did you mean this before?
Yep, this is more or less what i was thinking about. I thought you found a way to overcome this.
|
2

Here is an example where we had a column date and wanted to add another column with month.

Dataset<Row> newData = data.withColumn("month", month((unix_timestamp(col("date"), "MM/dd/yyyy")).cast("timestamp")));

Hoping this helps !

Cheers !

Comments

1

This thread is a little old, but I ran into a similar situation using Java. I think more than anything, there was a conceptual misunderstanding of how I should approach this problem.

To fix my issue, I created a simple POJO to assist with the new column for a Dataset (as opposed to trying to build on an existing one). I think conceptually, I didn't understand that it was best to generate the Dataset during the initial read where the additional column needed to be added. I hope this helps someone in the future.

Consider the following:

        JavaRDD<MyPojo> myRdd = dao.getSession().read().jdbc("jdbcurl","mytable",someObject.getProperties()).javaRDD().map( new Function<Row,MyPojo>() {

                       private static final long serialVersionUID = 1L;

                       @Override
                       public MyPojo call(Row row) throws Exception {
                       Integer curDos = calculateStuff(row);   //manipulate my data

                       MyPojo pojoInst = new MyPojo();

                       pojoInst.setBaseValue(row.getAs("BASE_VALUE_COLUMN"));
                       pojoInst.setKey(row.getAs("KEY_COLUMN"));
                       pojoInst.setCalculatedValue(curDos);

                       return pojoInst;
                      }
                    });

         Dataset<Row> myRddRFF = dao.getSession().createDataFrame(myRdd, MyPojo.class);

//continue load or other operation here... 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.