1

I have a dataset with two columns, metro, state. I give the following command in dplyr,

data %>% group_by(metro, State) %>% summarise(count = n())

I get the following output,

metro           State         count 
A                OH            703
A                NJ              3
B                GA           1453
B                CA            456
B                WA            123

I now want to filter out the rows in the dataframe which are only the maximum counts and leave out the remaining. I need to filter out the corresponding rows. The output after filtering out the rows for the following command should be,

data %>% group_by(metro, State) %>% summarise(count = n())

   metro           State         count 
    A                OH            703
    B                GA           1453

Where every metro has only state which is the state with maximum counts and remaining are removed.

The following is my trying,

data %>% group_by(metro, State) %>% filter(n() == max(n()))

But this is again giving out the same dataframe as input.

Can anybody help me in doing this? My output should be every metro should have a unique state which has the maximum counts and the remaining state entries should be removed.

Thanks

4
  • 1
    data %>% group_by(metro) %>% filter(count == max(count)) Commented Jul 11, 2016 at 17:44
  • @Psidom This still gives me the same output because when we group by metro, the count adds up and we can't filter out maximum entries. Commented Jul 11, 2016 at 17:49
  • @Psidom I am able to filter out after summarizing.. My question is to filter out the original dataframe with the corresponding rows. If previously the dataframe had 2738 rows, I need it to have 2156 rows only after filtering. THe rows corresponding to (3, 456, 123) counts should be removed . Commented Jul 11, 2016 at 17:51
  • Please add a reproducible example. Commented Jul 11, 2016 at 17:59

2 Answers 2

4

You need a double stage groupby, firstly groupby metro and state get the count and then groupby metro and filter out count that is not equal to the max count within each metro:

data1 <- data %>% group_by(metro, State) %>% mutate(count = n()) %>% 
                  group_by(metro) %>% filter(count == max(count))

nrow(data1)
Sign up to request clarification or add additional context in comments.

7 Comments

I am not sure why you get different results. But this seems to be working for me. Is this what you need?
This is not the requirement of my question. I have given my explanation in the comment. I want to remove the corresponding rows in the data frame. Not the summarized rows.
I am able to filter out after summarizing.. My question is to filter out the original dataframe with the corresponding rows. If previously the dataframe had 2738 rows, I need it to have 2156 rows only after filtering. THe rows corresponding to (3, 456, 123) counts should be removed .
Doesn't this give me only the counts ? My question here is I want to filter the entire original dataframe with the number of rows. Not the counts.
I think you should try the answer first. It will still give you the whole data frame just with a few rows filtered out.
|
0

We can also use data.table

library(data.table)
setDT(data)[,  count := .N , .(metro, state)][,  .SD[count == max(count)] , .(metro)]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.