Hot Linked Questions

55 votes

4 answers

96k views

Append a column to Data Frame in Apache Spark 1.3

Is it possible and what would be the most efficient neat method to add a column to Data Frame? More specifically, column may serve as Row IDs for the existing Data Frame. In a simplified case, ...

Oleg Shirokikh

3,653

asked Apr 7, 2015 at 3:59

4 votes

2 answers

13k views

add sequence number column in dataframe usnig scala

below is the logic to add sequence number column in dataframe. Its working as expected when I am reading data from delimited files. Today I have a new task to read the data from oracle table and add ...

Pravinkumar Hadpad

97

asked Sep 11, 2017 at 6:37

4 votes

1 answer

16k views

Auto - Incrementing pyspark dataframe column values

I am trying to generate an additional column in a dataframe with auto incrementing values based on the global value.However all the rows are generated with the same value and the value is not ...

Arjun

271

asked May 4, 2018 at 11:48

1 vote

1 answer

8k views

Using monotonically_increasing_id won't give consecutively IDs (pyspark)

I want to create an ID column for my pyspark dataframe, I have a column A that have repeated numbers, I want to take all the different values and assign an ID to each value I have: +----+ | A| +---...

Joe

611

asked Jun 19, 2019 at 20:46

0 votes

1 answer

2k views

Dataframe change first n rows

I've got a dataframe and I want to add another column which for the first n rows is one value, and for the rest is the value in another column... something like this frame.select("*") .withColumn("...

TheRealJimShady

4,365

asked Mar 16, 2017 at 22:56

2 votes

1 answer

2k views

How "stable" is monotonically_increasing_id() in Spark?

I'm looking for an inexpensive way to distinguish duplicates and/or uniquely identify rows. I've been looking at the Spark built-ins monotonically_increasing_id() and uuid(). The problem with uuid() ...

TrayMan

7,495

asked May 13, 2022 at 18:50

2 votes

0 answers

585 views

Generating unique ID for incremental update of Existing RDD in spark

I am attempting to do an incremental update to my RDD using union in spark. For that I have RDD1 ( already existing). RDD1 : JavaPairRDD<String,String>(uniqueID,data) where first String ...

Null-Pointer-Exception

91

asked Jan 4, 2018 at 18:13

0 votes

0 answers

506 views

monotonically_increasing_id is generating 2 different unique IDs for same record in spark 2.3.1?

I am creating a column in my dataframe using monotonically_increasing_id, over 2-3 transformation, for few of the records ID gets changed. e.g val newDf = df.withColumn("rowId", ...

RichaDwivedi

343

asked Sep 18, 2018 at 9:55

0 votes

1 answer

310 views

data losing while reading a file of huge size in spark scala

val data = spark.read .text(filepath) .toDF("val") .withColumn("id", monotonically_increasing_id()) val count = data.count() This code works fine when I am reading a file contains upto 50k+...

Sayantan

101

asked Mar 16, 2020 at 13:22

1 vote

0 answers

202 views

How do I add an persistent column of row ids to Spark DataFrame - #2

Basically I want the same thing as in this SO question. The accepted answer states that the issue is fixed with Spark 2.0 / Spark 2.1. I am using Spark 2.1.1. However, I still experience the same (a ...

akoeltringer

1,721

asked Jun 14, 2017 at 12:13

Collectives™ on Stack Overflow

Linked Questions

Append a column to Data Frame in Apache Spark 1.3

add sequence number column in dataframe usnig scala

Auto - Incrementing pyspark dataframe column values

Using monotonically_increasing_id won't give consecutively IDs (pyspark)

Dataframe change first n rows

How "stable" is monotonically_increasing_id() in Spark?

Generating unique ID for incremental update of Existing RDD in spark

monotonically_increasing_id is generating 2 different unique IDs for same record in spark 2.3.1?

data losing while reading a file of huge size in spark scala

How do I add an persistent column of row ids to Spark DataFrame - #2

Hot Network Questions