Filter rows based on the dplyr groupby, summarize output

Question

I have a dataset with two columns, metro, state. I give the following command in dplyr,

data %>% group_by(metro, State) %>% summarise(count = n())

I get the following output,

metro           State         count 
A                OH            703
A                NJ              3
B                GA           1453
B                CA            456
B                WA            123

I now want to filter out the rows in the dataframe which are only the maximum counts and leave out the remaining. I need to filter out the corresponding rows. The output after filtering out the rows for the following command should be,

data %>% group_by(metro, State) %>% summarise(count = n())

   metro           State         count 
    A                OH            703
    B                GA           1453

Where every metro has only state which is the state with maximum counts and remaining are removed.

The following is my trying,

data %>% group_by(metro, State) %>% filter(n() == max(n()))

But this is again giving out the same dataframe as input.

Can anybody help me in doing this? My output should be every metro should have a unique state which has the maximum counts and the remaining state entries should be removed.

Thanks

@Psidom This still gives me the same output because when we group by metro, the count adds up and we can't filter out maximum entries. — haimen
– haimen, Commented Jul 11, 2016 at 17:49
@Psidom I am able to filter out after summarizing.. My question is to filter out the original dataframe with the corresponding rows. If previously the dataframe had 2738 rows, I need it to have 2156 rows only after filtering. THe rows corresponding to (3, 456, 123) counts should be removed . — haimen
– haimen, Commented Jul 11, 2016 at 17:51

akuiper · Accepted Answer · 2016-07-11 18:07:56Z

4

You need a double stage groupby, firstly groupby metro and state get the count and then groupby metro and filter out count that is not equal to the max count within each metro:

data1 <- data %>% group_by(metro, State) %>% mutate(count = n()) %>% 
                  group_by(metro) %>% filter(count == max(count))

nrow(data1)

edited Jul 11, 2016 at 18:07

answered Jul 11, 2016 at 17:52

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

akuiper Over a year ago

I am not sure why you get different results. But this seems to be working for me. Is this what you need?

haimen Over a year ago

This is not the requirement of my question. I have given my explanation in the comment. I want to remove the corresponding rows in the data frame. Not the summarized rows.

haimen Over a year ago

I am able to filter out after summarizing.. My question is to filter out the original dataframe with the corresponding rows. If previously the dataframe had 2738 rows, I need it to have 2156 rows only after filtering. THe rows corresponding to (3, 456, 123) counts should be removed .

haimen Over a year ago

Doesn't this give me only the counts ? My question here is I want to filter the entire original dataframe with the number of rows. Not the counts.

akuiper Over a year ago

I think you should try the answer first. It will still give you the whole data frame just with a few rows filtered out.

|

akrun · Accepted Answer · 2016-07-12 02:50:30Z

0

We can also use data.table

library(data.table)
setDT(data)[,  count := .N , .(metro, state)][,  .SD[count == max(count)] , .(metro)]

answered Jul 12, 2016 at 2:50

akrun

891k38 gold badges590 silver badges700 bronze badges

Collectives™ on Stack Overflow

Filter rows based on the dplyr groupby, summarize output

2 Answers 2

7 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Related