Linked Questions

55 votes
4 answers
96k views

Append a column to Data Frame in Apache Spark 1.3

Is it possible and what would be the most efficient neat method to add a column to Data Frame? More specifically, column may serve as Row IDs for the existing Data Frame. In a simplified case, ...
Oleg Shirokikh's user avatar
4 votes
2 answers
13k views

add sequence number column in dataframe usnig scala

below is the logic to add sequence number column in dataframe. Its working as expected when I am reading data from delimited files. Today I have a new task to read the data from oracle table and add ...
Pravinkumar Hadpad's user avatar
4 votes
1 answer
16k views

Auto - Incrementing pyspark dataframe column values

I am trying to generate an additional column in a dataframe with auto incrementing values based on the global value.However all the rows are generated with the same value and the value is not ...
Arjun's user avatar
  • 271
1 vote
1 answer
8k views

Using monotonically_increasing_id won't give consecutively IDs (pyspark)

I want to create an ID column for my pyspark dataframe, I have a column A that have repeated numbers, I want to take all the different values and assign an ID to each value I have: +----+ | A| +---...
Joe's user avatar
  • 611
0 votes
1 answer
2k views

Dataframe change first n rows

I've got a dataframe and I want to add another column which for the first n rows is one value, and for the rest is the value in another column... something like this frame.select("*") .withColumn("...
TheRealJimShady's user avatar
2 votes
1 answer
2k views

How "stable" is monotonically_increasing_id() in Spark?

I'm looking for an inexpensive way to distinguish duplicates and/or uniquely identify rows. I've been looking at the Spark built-ins monotonically_increasing_id() and uuid(). The problem with uuid() ...
TrayMan's user avatar
  • 7,495
2 votes
0 answers
585 views

Generating unique ID for incremental update of Existing RDD in spark

I am attempting to do an incremental update to my RDD using union in spark. For that I have RDD1 ( already existing). RDD1 : JavaPairRDD<String,String>(uniqueID,data) where first String ...
Null-Pointer-Exception's user avatar
0 votes
0 answers
506 views

monotonically_increasing_id is generating 2 different unique IDs for same record in spark 2.3.1?

I am creating a column in my dataframe using monotonically_increasing_id, over 2-3 transformation, for few of the records ID gets changed. e.g val newDf = df.withColumn("rowId", ...
RichaDwivedi's user avatar
0 votes
1 answer
310 views

data losing while reading a file of huge size in spark scala

val data = spark.read .text(filepath) .toDF("val") .withColumn("id", monotonically_increasing_id()) val count = data.count() This code works fine when I am reading a file contains upto 50k+...
Sayantan's user avatar
  • 101
1 vote
0 answers
202 views

How do I add an persistent column of row ids to Spark DataFrame - #2

Basically I want the same thing as in this SO question. The accepted answer states that the issue is fixed with Spark 2.0 / Spark 2.1. I am using Spark 2.1.1. However, I still experience the same (a ...
akoeltringer's user avatar
  • 1,721