Summarize the row based on condition

Question

Creating a sample dataset to reproduced the problem

library(dplyr)
x <- c('MS','Google','MS','FB','Amazon','Google','IBM','IBM','IBM','MS')
item <- as.data.frame(x,stringsAsFactors = F)
data <- item %>% group_by(x) %>% summarise(n = n())

# A tibble: 5 x 2
  x          n
  <chr>  <int>
1 Amazon     1
2 FB         1
3 Google     2
4 IBM        3
5 MS         3

Now my intent to create a dataset where all the 'n' count less than 2 should be summarize in a row called 'other' meanwhile it also sum the n count like

x          n
  <chr>  <int>
1 Other      2
2 Google     2
3 IBM        3
4 MS         3

I am able to achieve it by below mentioned code but I am sure its not good way to do this, Please suggest me if I can do same directly by dplyr query

data$x[data$n < 2]= 'Other'
data <- aggregate(n~x, data, FUN = sum)

Sotos · Accepted Answer · 2018-03-10 12:35:48Z

Here is an idea via dplyr,

library(dplyr)

data %>% 
 mutate(grp = cumsum(c(1, diff(n < 2) != 0)), 
        grp = replace(grp, n >=2, grp[n >= 2] + row_number()[n >= 2])) %>%
 group_by(grp) %>% 
 summarise(x = toString(x), n = sum(n)) %>% 
 ungroup() %>% 
 select(-grp)

which gives,

# A tibble: 4 x 2
  x              n
  <chr>      <int>
1 Amazon, FB     2
2 Google         2
3 IBM            3
4 MS             3

NOTE: If you really want to use Other then, at the end of the pipe, add the following,

... %>% mutate(x = replace(x, grepl(',', x), 'Other'))

To 'decipher' the cumsum part for the grouping lets break it down.

We want to create groups where all values within that group are less than 2. However, inevitably, we also create groups for values greater (or equal) than 2. In order to avoid summarising those groups, we replace them by adding an incremental value on them. This will ensure that groups with values greater than 2 will only have one element in, thus ensuring they won't get summarised at the end as well. The trick to get the groups is to create a logical vector with values less than 2, and take the difference to find when it changes from TRUE to FALSE (hence the ...!= 0 part). Since the diff will remove a value, we add it manually by c(1, diff(...)). Note that instead of 1 you could put TRUE. The cumsum then creates the groups. In order to avoid summarising the groups with values > 2, we replace them by adding their row_number to them. Why row_number? Because it increases thus making all groups unique.

x <- c(1, 1, 3, 4, 2, 1, 1, 1, 5)

x < 2
#[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
diff(x < 2) != 0
#[1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
cumsum(c(1, diff(x < 2) != 0))
#[1] 1 1 2 2 2 3 3 3 4

thanks buddy, the logic of 'cumsum' was beyond of my knowledge, I am still figuring out like how it perform such magic
the last one also giving result, please help me to understand the issue in that

akrun · Accepted Answer · 2018-03-10 13:50:07Z

2

We could also use case_when within group_by to change 'x' values to 'Other' where the 'n' is 1 and then do the sum of 'n' in summarise

library(dplyr)
data %>% 
   group_by(x = case_when(n ==1 ~ 'Other', 
                              TRUE ~ x)) %>% 
   summarise(n = sum(n))
# A tibble: 4 x 2
#  x          n
#   <chr>  <int>
#1 Google     2
#2 IBM        3
#3 MS         3
#4 Other      2

answered Mar 10, 2018 at 13:50

akrun

891k38 gold badges590 silver badges700 bronze badges

1 Comment

Sotos Over a year ago

Haha nice!! I really made it much more complicated than it is :). Oh well, at least they learned the perks of grouping with cumsum and diff

MKR · Accepted Answer · 2018-03-10 13:36:36Z

Another option could be achieved using bind_rows and filter as:

library(dplyr)
x <- c('MS','Google','MS','FB','Amazon','Google','IBM','IBM','IBM','MS')
item <- as.data.frame(x,stringsAsFactors = F)
data <- item %>% group_by(x) %>% summarise(n = n())

data %>% {
  bind_rows(filter(., n >= 2), 
            filter(., n < 2) %>% summarise(x = "Other",  n = sum(n))
            )

}

#  x          n
#  <chr>  <int>
#1 Google     2
#2 IBM        3
#3 MS         3
#4 Other      2

Collectives™ on Stack Overflow

Summarize the row based on condition

3 Answers 3

3 Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Related