I have a data frame like the below:
+---------+----------+--------+-----+------+------+----------------------------------------------------------------------------------------------------+
|firstname|middlename|lastname|id |gender|salary|meta |
+---------+----------+--------+-----+------+------+----------------------------------------------------------------------------------------------------+
|James | |Smith |36636|M |3000 |{"firstname":"James","middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000} |
|Michael |Rose | |40288|M |4000 |{"firstname":"Michael","middlename":"Rose","lastname":"","id":"40288","gender":"M","salary":4000} |
|Robert | |Williams|42114|M |4000 |{"firstname":"Robert","middlename":"","lastname":"Williams","id":"42114","gender":"M","salary":4000}|
|Maria |Anne |Jones |39192|F |4000 |{"firstname":"Maria","middlename":"Anne","lastname":"Jones","id":"39192","gender":"F","salary":4000}|
|Jen |Mary |Brown | |F |-1 |{"firstname":"Jen","middlename":"Mary","lastname":"Brown","id":"","gender":"F","salary":-1} |
+---------+----------+--------+-----+------+------+----------------------------------------------------------------------------------------------------+
Now, there is a UDF for which I need to iterate over the meta column and pass each row to that UDF. However, I am only able to pass the first row.
Below is the code:
def parse_and_post(col):
for i in col.collect():
print(i)
result = json.loads(i)
print(result["firstname"])
#Below is a sample check
if result["firstname"] == "James":
return 200
else:
return -1
#Actual check is as follows
#Format the record in the form of list
#get token
#response = send the record to the API
#return the response
new_df_status = new_df.withColumn("status_of_each_record", lit(parse_and_post(new_df.rdd.map(lambda x: x["meta"]))))
When I execute this code I get the output as below. However, status for only first record should be 200 and rest should be -1:
{"firstname":"James","middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000}
James
+---------+----------+--------+-----+------+------+----------------------------------------------------------------------------------------------------+---------------------+
|firstname|middlename|lastname|id |gender|salary|meta |status_of_each_record|
+---------+----------+--------+-----+------+------+----------------------------------------------------------------------------------------------------+---------------------+
|James | |Smith |36636|M |3000 |{"firstname":"James","middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000} |200 |
|Michael |Rose | |40288|M |4000 |{"firstname":"Michael","middlename":"Rose","lastname":"","id":"40288","gender":"M","salary":4000} |200 |
|Robert | |Williams|42114|M |4000 |{"firstname":"Robert","middlename":"","lastname":"Williams","id":"42114","gender":"M","salary":4000}|200 |
|Maria |Anne |Jones |39192|F |4000 |{"firstname":"Maria","middlename":"Anne","lastname":"Jones","id":"39192","gender":"F","salary":4000}|200 |
|Jen |Mary |Brown | |F |-1 |{"firstname":"Jen","middlename":"Mary","lastname":"Brown","id":"","gender":"F","salary":-1} |200 |
+---------+----------+--------+-----+------+------+----------------------------------------------------------------------------------------------------+---------------------+
How to iterate over each row of column meta. What exactly am I doing wrong here?