In the following SO post How to identify partial duplicates of rows in R, I asked how to get rid of partially duplicated rows. Here's what I asked:
I would like to identify "partial" matches of rows in a dataframe. Specifically, I want to create a new column with a value of 1 if a particular row in a dataframe has duplicate row somewhere else in the dataframe based on a match between a subset of columns. An added complexity is that one of the columns in the dataframe is numeric and I want to match if the absolute values match.
The issue is that I need to make sure that when a row is identified as partially duplicated, it is so ONLY if ONE of the columns that's part of the match is the mirror opposite value and not just a match on an absolute value. To make things clearer, here's the sample data from the previous post:
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California")
num<-c("-258", "123", "42", "258")
date<-c("day 2", "day 15", "day 3","day 45")
(df<-as.data.frame(cbind(name,state,num, date)))
name state num date
1 Richard Nixon California -258 day 2
2 Bill Clinton Indiana 123 day 15
3 George Bush Florida 42 day 3
4 Richard Nixon California 258 day 45
Here was the solution to my previous post:
df$absnum = abs(as.numeric(as.character(df$num)))
df$newcol = duplicated(df[,c('name','state', 'absnum')]) |
duplicated(df[,c('name','state', 'absnum')], fromLast = T)
# name state num date absnum newcol
# 1 Richard Nixon California -258 day 2 258 TRUE
# 2 Bill Clinton Indiana 123 day 15 123 FALSE
# 3 George Bush Florida 42 day 3 42 FALSE
# 4 Richard Nixon California 258 day 45 258 TRUE
Note that row 1 and row 4 are labeled TRUE under newcol, which is fine. And here is new sample data with the added complexity issue:
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill
Clinton")
state<-c("California", "Indiana", "Florida", "California", "Indiana")
num<-c("-258", "123", "42", "258", "123")
date<-c("day 2", "day 15", "day 3","day 45", "day 100")
(df<-as.data.frame(cbind(name,state,num, date)))
name state num date
1 Richard Nixon California -258 day 2
2 Bill Clinton Indiana 123 day 15
3 George Bush Florida 42 day 3
4 Richard Nixon California 258 day 45
5 Bill Clinton Indiana 123 day 100
Note that observations 2 and 5 are partial duplicates but not in the same way as 1 and 4. I need to apply TRUE only to those observations in which their absolute values match BUT NOT their original values. So I want the result to return the following:
name state num date newcol
1 Richard Nixon California -258 day 2 TRUE
2 Bill Clinton Indiana 123 day 15 FALSE
3 George Bush Florida 42 day 3 FALSE
4 Richard Nixon California 258 day 45 TRUE
5 Bill Clinton Indiana 123 day 100 FALSE
The solution provided by the previous SO post would apply TRUE to rows 2 and 5 when I only would like this applied to rows 1 and 4.