Linked Questions

1 vote
0 answers
2k views

I am trying to understand the use of monotonically_increasing_id in Spark SQL. Can anyone explain with an example, why do we need to have monotonically increasing ids in case of dataframes?
Nikhil Mishra's user avatar
1 vote
2 answers
944 views

I have a DataFrame and I want to add a new column but not based on exit column,what should I do? This is my dataframe: +----+ |time| +----+ | 1| | 4| | 3| | 2| | 5| | 7| | 3| | 5| +--...
mentongwu's user avatar
  • 473
-1 votes
1 answer
763 views

There are a lot of examples that all give the same basic example. dfWithIndex = df.withColumn('f_index', \ pyspark.sql.functions.lit(1).cast(pyspark.sql.types.LongType())) rdd = df.rdd.zipWithIndex(...
PJ Evans's user avatar
18 votes
2 answers
13k views

I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but ...
Jason Evans's user avatar
  • 1,237
10 votes
3 answers
30k views

I have a dataframe where I have to generate a unique Id in one of the columns. This id has to be generated with an offset. Because , I need to persist this dataframe with the autogenerated id , now if ...
Ayan Biswas's user avatar
  • 1,665
2 votes
2 answers
10k views

I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. ...
Daniel Thereza's user avatar
5 votes
1 answer
2k views

Using PySpark's ML module, the following steps often occur (after data cleaning, etc): Perform feature and target transform pipeline Create model Generate predictions from the model Merge predictions ...
Mike Williamson's user avatar
2 votes
2 answers
3k views

The below question has solution for scala and pyspark and the solution provided in this question is not for consecutive index values. Spark Dataframe :How to add a index Column : Aka Distributed ...
user0204's user avatar
  • 261
1 vote
1 answer
4k views

here is my dataframe: I am looking for the right way to replace city's value based on the name, for example, case name when 'Alice' then 'New York' when 'Alex' then 'LA' when 'Aaron' then 'Beijing' ...
mdivk's user avatar
  • 3,747
0 votes
2 answers
236 views

I have 2 DataFrames: Users (~29.000.000 entries) |-- userId: string (nullable = true) Impressions (~1000 entries) |-- modules: array (nullable = true) | |-- element: struct (containsNull = true) ...
esbej's user avatar
  • 27
1 vote
1 answer
117 views

I am trying to combine columns from different dataframes into one for analysis. I am collecting all the columns I need into a dictionary. I now have a dictionary like this - newDFDict = { '...
raghhuveer-jaikanth's user avatar
2 votes
1 answer
109 views

I have a CSV file three columns values 1st Column 2nd Column 3rd Column ram karthi bruce RAM KATHI BRUCE ram karthi bruce I want ...
himanish's user avatar