2

Creating a sample dataset to reproduced the problem

library(dplyr)
x <- c('MS','Google','MS','FB','Amazon','Google','IBM','IBM','IBM','MS')
item <- as.data.frame(x,stringsAsFactors = F)
data <- item %>% group_by(x) %>% summarise(n = n())

# A tibble: 5 x 2
  x          n
  <chr>  <int>
1 Amazon     1
2 FB         1
3 Google     2
4 IBM        3
5 MS         3

Now my intent to create a dataset where all the 'n' count less than 2 should be summarize in a row called 'other' meanwhile it also sum the n count like

x          n
  <chr>  <int>
1 Other      2
2 Google     2
3 IBM        3
4 MS         3

I am able to achieve it by below mentioned code but I am sure its not good way to do this, Please suggest me if I can do same directly by dplyr query

data$x[data$n < 2]= 'Other'
data <- aggregate(n~x, data, FUN = sum)

3 Answers 3

4

Here is an idea via dplyr,

library(dplyr)

data %>% 
 mutate(grp = cumsum(c(1, diff(n < 2) != 0)), 
        grp = replace(grp, n >=2, grp[n >= 2] + row_number()[n >= 2])) %>%
 group_by(grp) %>% 
 summarise(x = toString(x), n = sum(n)) %>% 
 ungroup() %>% 
 select(-grp)

which gives,

# A tibble: 4 x 2
  x              n
  <chr>      <int>
1 Amazon, FB     2
2 Google         2
3 IBM            3
4 MS             3

NOTE: If you really want to use Other then, at the end of the pipe, add the following,

... %>% mutate(x = replace(x, grepl(',', x), 'Other'))

To 'decipher' the cumsum part for the grouping lets break it down.

We want to create groups where all values within that group are less than 2. However, inevitably, we also create groups for values greater (or equal) than 2. In order to avoid summarising those groups, we replace them by adding an incremental value on them. This will ensure that groups with values greater than 2 will only have one element in, thus ensuring they won't get summarised at the end as well. The trick to get the groups is to create a logical vector with values less than 2, and take the difference to find when it changes from TRUE to FALSE (hence the ...!= 0 part). Since the diff will remove a value, we add it manually by c(1, diff(...)). Note that instead of 1 you could put TRUE. The cumsum then creates the groups. In order to avoid summarising the groups with values > 2, we replace them by adding their row_number to them. Why row_number? Because it increases thus making all groups unique.

x <- c(1, 1, 3, 4, 2, 1, 1, 1, 5)

x < 2
#[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
diff(x < 2) != 0
#[1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
cumsum(c(1, diff(x < 2) != 0))
#[1] 1 1 2 2 2 3 3 3 4
Sign up to request clarification or add additional context in comments.

3 Comments

thanks buddy, the logic of 'cumsum' was beyond of my knowledge, I am still figuring out like how it perform such magic
@Vineet Just break it down. I added an example
the last one also giving result, please help me to understand the issue in that
2

We could also use case_when within group_by to change 'x' values to 'Other' where the 'n' is 1 and then do the sum of 'n' in summarise

library(dplyr)
data %>% 
   group_by(x = case_when(n ==1 ~ 'Other', 
                              TRUE ~ x)) %>% 
   summarise(n = sum(n))
# A tibble: 4 x 2
#  x          n
#   <chr>  <int>
#1 Google     2
#2 IBM        3
#3 MS         3
#4 Other      2

1 Comment

Haha nice!! I really made it much more complicated than it is :). Oh well, at least they learned the perks of grouping with cumsum and diff
1

Another option could be achieved using bind_rows and filter as:

library(dplyr)
x <- c('MS','Google','MS','FB','Amazon','Google','IBM','IBM','IBM','MS')
item <- as.data.frame(x,stringsAsFactors = F)
data <- item %>% group_by(x) %>% summarise(n = n())

data %>% {
  bind_rows(filter(., n >= 2), 
            filter(., n < 2) %>% summarise(x = "Other",  n = sum(n))
            )

}

#  x          n
#  <chr>  <int>
#1 Google     2
#2 IBM        3
#3 MS         3
#4 Other      2

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.