Hot Linked Questions

1 vote

0 answers

2k views

What is the Use of monotonically_increasing_id in PySpark [duplicate]

I am trying to understand the use of monotonically_increasing_id in Spark SQL. Can anyone explain with an example, why do we need to have monotonically increasing ids in case of dataframes?

Nikhil Mishra

1,270

asked Oct 19, 2018 at 2:24

1 vote

2 answers

944 views

How to add new column not based on exist column in dataframe with Scala/Spark? [duplicate]

I have a DataFrame and I want to add a new column but not based on exit column,what should I do? This is my dataframe: +----+ |time| +----+ | 1| | 4| | 3| | 2| | 5| | 7| | 3| | 5| +--...

mentongwu

473

asked Jul 21, 2017 at 3:05

-1 votes

1 answer

763 views

Add an index to a dataframe. Pyspark 2.4.4 [duplicate]

There are a lot of examples that all give the same basic example. dfWithIndex = df.withColumn('f_index', \ pyspark.sql.functions.lit(1).cast(pyspark.sql.types.LongType())) rdd = df.rdd.zipWithIndex(...

PJ Evans

51

asked Feb 4, 2021 at 0:31

18 votes

2 answers

13k views

Does Spark preserve record order when reading in ordered files?

I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but ...

Jason Evans

1,237

asked Aug 22, 2017 at 15:55

10 votes

3 answers

30k views

How to create an unique autogenerated Id column in a spark dataframe

I have a dataframe where I have to generate a unique Id in one of the columns. This id has to be generated with an offset. Because , I need to persist this dataframe with the autogenerated id , now if ...

Ayan Biswas

1,665

asked Mar 25, 2019 at 15:32

2 votes

2 answers

10k views

How to use an existing column as index in Spark's Dataframe

I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. ...

Daniel Thereza

21

asked May 30, 2019 at 17:13

5 votes

1 answer

2k views

PySpark align model predictions with untransformed data: best practice

Using PySpark's ML module, the following steps often occur (after data cleaning, etc): Perform feature and target transform pipeline Create model Generate predictions from the model Merge predictions ...

Mike Williamson

3,558

asked Sep 3, 2020 at 14:08

2 votes

2 answers

3k views

Add index column to apache spark Dataset<Row> using java

The below question has solution for scala and pyspark and the solution provided in this question is not for consecutive index values. Spark Dataframe :How to add a index Column : Aka Distributed ...

user0204

261

asked May 16, 2019 at 7:42

1 vote

1 answer

4k views

How to change a cell's value in dataframe with pySpark?

here is my dataframe: I am looking for the right way to replace city's value based on the name, for example, case name when 'Alice' then 'New York' when 'Alex' then 'LA' when 'Aaron' then 'Beijing' ...

mdivk

3,747

asked Jul 25, 2016 at 3:36

0 votes

2 answers

236 views

How to combine 2 different dataframes together?

I have 2 DataFrames: Users (~29.000.000 entries) |-- userId: string (nullable = true) Impressions (~1000 entries) |-- modules: array (nullable = true) | |-- element: struct (containsNull = true) ...

esbej

27

asked Nov 13, 2017 at 11:46

1 vote

1 answer

117 views

Convert dictionary of columns to Dataframe in from different dataframes : pyspark

I am trying to combine columns from different dataframes into one for analysis. I am collecting all the columns I need into a dictionary. I now have a dictionary like this - newDFDict = { '...

raghhuveer-jaikanth

193

asked May 13, 2020 at 17:49

2 votes

1 answer

109 views

How to capitalize middle row of a column in PySpark or Pandas

I have a CSV file three columns values 1st Column 2nd Column 3rd Column ram karthi bruce RAM KATHI BRUCE ram karthi bruce I want ...

himanish

21

asked Nov 12, 2020 at 7:49

Collectives™ on Stack Overflow

Linked Questions

What is the Use of monotonically_increasing_id in PySpark [duplicate]

How to add new column not based on exist column in dataframe with Scala/Spark? [duplicate]

Add an index to a dataframe. Pyspark 2.4.4 [duplicate]

Does Spark preserve record order when reading in ordered files?

How to create an unique autogenerated Id column in a spark dataframe

How to use an existing column as index in Spark's Dataframe

PySpark align model predictions with untransformed data: best practice

Add index column to apache spark Dataset<Row> using java

How to change a cell's value in dataframe with pySpark?

How to combine 2 different dataframes together?

Convert dictionary of columns to Dataframe in from different dataframes : pyspark

How to capitalize middle row of a column in PySpark or Pandas

Hot Network Questions