I have a LARGE, sorted PySpark dataframe "df", that I need to iterate through and do the following for each row (by index):
If "row['col1'] == nextrow['col1']:
If nextrow['col3'] == 1:
thisrow['col4'] == 1
For example given:
# +---+----+----+----+
# |id |col1|col3|col4|
# +---+----+----+----+
# |1 |33 |1 |0 |
# |2 |33 |0 |0 |
# |3 |33 |0 |0 |
# |4 |11 |1 |0 |
# |5 |11 |1 |0 |
# |6 |22 |0 |0 |
# |7 |22 |1 |0 |
# +---+----+----+----+
Would generate:
# +---+----+----+----+
# |id |col1|col3|col4|
# +---+----+----+----+
# |1 |33 |1 |0 |
# |2 |33 |0 |0 |
# |3 |33 |0 |0 |
# |4 |11 |1 |1 |
# |5 |11 |1 |0 |
# |6 |22 |0 |1 |
# |7 |22 |1 |0 |
# +---+----+----+----+
I know Spark dataframes are immutable. What is the best way to do this? I've thought about converting it to an RDD and creating a function for a map+lambda combo, but I do not know how to determine which row I am on without adding an index column.