Linked Questions
12 questions linked to/from Spark Dataframe :How to add a index Column : Aka Distributed Data Index
1
vote
0
answers
2k
views
What is the Use of monotonically_increasing_id in PySpark [duplicate]
I am trying to understand the use of monotonically_increasing_id in Spark SQL.
Can anyone explain with an example, why do we need to have monotonically increasing ids in case of dataframes?
1
vote
2
answers
944
views
How to add new column not based on exist column in dataframe with Scala/Spark? [duplicate]
I have a DataFrame and I want to add a new column but not based on exit column,what should I do?
This is my dataframe:
+----+
|time|
+----+
| 1|
| 4|
| 3|
| 2|
| 5|
| 7|
| 3|
| 5|
+--...
-1
votes
1
answer
763
views
Add an index to a dataframe. Pyspark 2.4.4 [duplicate]
There are a lot of examples that all give the same basic example.
dfWithIndex = df.withColumn('f_index', \
pyspark.sql.functions.lit(1).cast(pyspark.sql.types.LongType()))
rdd = df.rdd.zipWithIndex(...
18
votes
2
answers
13k
views
Does Spark preserve record order when reading in ordered files?
I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but ...
10
votes
3
answers
30k
views
How to create an unique autogenerated Id column in a spark dataframe
I have a dataframe where I have to generate a unique Id in one of the columns. This id has to be generated with an offset.
Because , I need to persist this dataframe with the autogenerated id , now if ...
2
votes
2
answers
10k
views
How to use an existing column as index in Spark's Dataframe
I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. ...
5
votes
1
answer
2k
views
PySpark align model predictions with untransformed data: best practice
Using PySpark's ML module, the following steps often occur (after data cleaning, etc):
Perform feature and target transform pipeline
Create model
Generate predictions from the model
Merge predictions ...
2
votes
2
answers
3k
views
Add index column to apache spark Dataset<Row> using java
The below question has solution for scala and pyspark and the solution provided in this question is not for consecutive index values.
Spark Dataframe :How to add a index Column : Aka Distributed ...
1
vote
1
answer
4k
views
How to change a cell's value in dataframe with pySpark?
here is my dataframe:
I am looking for the right way to replace city's value based on the name, for example, case name when 'Alice' then 'New York' when 'Alex' then 'LA' when 'Aaron' then 'Beijing' ...
0
votes
2
answers
236
views
How to combine 2 different dataframes together?
I have 2 DataFrames:
Users (~29.000.000 entries)
|-- userId: string (nullable = true)
Impressions (~1000 entries)
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
...
1
vote
1
answer
117
views
Convert dictionary of columns to Dataframe in from different dataframes : pyspark
I am trying to combine columns from different dataframes into one for analysis. I am collecting all the columns I need into a dictionary.
I now have a dictionary like this -
newDFDict = {
'...
2
votes
1
answer
109
views
How to capitalize middle row of a column in PySpark or Pandas
I have a CSV file three columns values
1st Column 2nd Column 3rd Column
ram karthi bruce
RAM KATHI BRUCE
ram karthi bruce
I want ...