1

I have a dataframe. It contains the amount of sales for different items across different sales outlets. The dataframe shown below only shows few of the items across few sales outlets. There's a bench mark of 100 items per day sale for each item. For each item that's sold more than 100, it is marked as "Yes" and those below 100 is marked as "No"

val df1 = Seq(
("Mumbai", 90,  109, , 101, 78, ............., "No", "Yes", "Yes", "No", .....),
("Singapore", 149,  129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Hawaii", 127,  101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", .....),
("New York", 146,  130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Los Angeles", 94,  99, , 95, 113, ............., "No", "No", "No", "Yes", .....),
("Dubai", 201,  229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", .....),
("Bangalore", 56,  89, , 61, 77, ............., "No", "No", "No", "No", .....))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", .....)

Now,I want to add a column "Count_of_Yes" in which for each sales outlets (each row), the value of the column "Count_of_Yes" will be the total number of "Yes" in that row. How do I iterate over each row to get the count of Yes?

My expected dataframe should be

val output_df = Seq(
("Mumbai", 90,  109, , 101, 78, ............., "No", "Yes", "Yes", "No", ....., 2),
("Singapore", 149,  129, , 201, 107, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Hawaii", 127,  101, , 98, 109, ............., "Yes", "Yes", "No", "Yes", ....., 3),
("New York", 146,  130, , 173, 117, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Los Angeles", 94,  99, , 95, 113, ............., "No", "No", "No", "Yes", ....., 1),
("Dubai", 201,  229, , 265, 317, ............., "Yes", "Yes", "Yes", "Yes", ....., 4),
("Bangalore", 56,  89, , 61, 77, ............., "No", "No", "No", "No", ....., 0))
.toDF("Outlet","Boys_Toys","Girls_Toys","Men_Shoes","Ladies_shoes", ............., "BT>100", "GT>100", "MS>100", "LS>100", ....., "Count_of_Yes")

2 Answers 2

2

You can convert the selected list of columns into an Array of 1s (for "yes") and 0s (for "no") and sum the array elements with aggregate in SQL expression using selectExpr, as shown below:

val df = Seq(
  (1, 120, 80, 150, "Y", "N", "Y"),
  (2, 50, 90, 110, "N", "N", "Y"),
  (3, 70, 160, 90, "N", "Y", "N")
).toDF("id", "qty_a", "qty_b", "qty_c", "over100_a", "over100_b", "over100_c")

val cols = df.columns.filter(_.startsWith("over100_"))

df.
  withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
  selectExpr("*", "aggregate(arr, 0, (acc, x) -> acc + x) as yes_count").
  show
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// | id|qty_a|qty_b|qty_c|over100_a|over100_b|over100_c|      arr|yes_count|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+
// |  1|  120|   80|  150|        Y|        N|        Y|[1, 0, 1]|        2|
// |  2|   50|   90|  110|        N|        N|        Y|[0, 0, 1]|        1|
// |  3|   70|  160|   90|        N|        Y|        N|[0, 1, 0]|        1|
// +---+-----+-----+-----+---------+---------+---------+---------+---------+

Alternatively, use explode and groupBy/agg to sum the Array elements:

df.
  withColumn("arr", array(cols.map(c => when(col(c) === "Y", 1).otherwise(0)): _*)).
  withColumn("flattened", explode($"arr")).
  groupBy("id").agg(sum($"flattened").as("yes_count"))
Sign up to request clarification or add additional context in comments.

1 Comment

Leo C, thanks for this wonderful answer. It solved the problem.
-1

How do I iterate over each row to get the count of Yes? You can use a map transformation to transform each record. So in your case df.map() should have the code to count number of YES and emit a new record which has this additional column.

Pseudo code as follows -

df.map(count number of YES and append that at the end of the string")

1 Comment

Thanks Amit. Do you mind sharing a code to achieve this? I do not fully grasp the concept .Anticipating your response. Thanks

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.