How to identify mirrored duplicates of rows in R

Question

In the following SO post How to identify partial duplicates of rows in R, I asked how to get rid of partially duplicated rows. Here's what I asked:

I would like to identify "partial" matches of rows in a dataframe. Specifically, I want to create a new column with a value of 1 if a particular row in a dataframe has duplicate row somewhere else in the dataframe based on a match between a subset of columns. An added complexity is that one of the columns in the dataframe is numeric and I want to match if the absolute values match.

The issue is that I need to make sure that when a row is identified as partially duplicated, it is so ONLY if ONE of the columns that's part of the match is the mirror opposite value and not just a match on an absolute value. To make things clearer, here's the sample data from the previous post:

name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California")
num<-c("-258", "123", "42", "258")
date<-c("day 2", "day 15", "day 3","day 45")
(df<-as.data.frame(cbind(name,state,num, date)))
           name      state  num   date
1 Richard Nixon California -258  day 2
2  Bill Clinton    Indiana  123 day 15
3   George Bush    Florida   42  day 3
4 Richard Nixon California  258 day 45

Here was the solution to my previous post:

df$absnum = abs(as.numeric(as.character(df$num)))
df$newcol = duplicated(df[,c('name','state', 'absnum')]) | 
  duplicated(df[,c('name','state', 'absnum')], fromLast = T)

#            name      state  num   date absnum newcol
# 1 Richard Nixon California -258  day 2    258   TRUE
# 2  Bill Clinton    Indiana  123 day 15    123  FALSE
# 3   George Bush    Florida   42  day 3     42  FALSE
# 4 Richard Nixon California  258 day 45    258   TRUE

Note that row 1 and row 4 are labeled TRUE under newcol, which is fine. And here is new sample data with the added complexity issue:

name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill 
Clinton")
state<-c("California", "Indiana", "Florida", "California", "Indiana")
num<-c("-258", "123", "42", "258", "123")
date<-c("day 2", "day 15", "day 3","day 45", "day 100")
(df<-as.data.frame(cbind(name,state,num, date)))

  name           state      num   date
1 Richard Nixon  California -258  day 2
2 Bill Clinton   Indiana    123   day 15
3 George Bush    Florida    42    day 3
4 Richard Nixon  California 258   day 45
5 Bill Clinton   Indiana    123   day 100

Note that observations 2 and 5 are partial duplicates but not in the same way as 1 and 4. I need to apply TRUE only to those observations in which their absolute values match BUT NOT their original values. So I want the result to return the following:

  name           state      num   date    newcol
1 Richard Nixon  California -258  day 2   TRUE
2 Bill Clinton   Indiana    123   day 15  FALSE
3 George Bush    Florida    42    day 3   FALSE
4 Richard Nixon  California 258   day 45  TRUE
5 Bill Clinton   Indiana    123   day 100 FALSE

The solution provided by the previous SO post would apply TRUE to rows 2 and 5 when I only would like this applied to rows 1 and 4.

do you always only have (at most) two entries per name and state? What do you do in scenarios where there are 3+? Do you compare entries 1 & 2, and then 2 & 3, and then 1 & 3? — Chase
– Chase, Commented Feb 13, 2019 at 2:54
@Chase I opened up a new SO question that tries to tackle this issue you mentioned: stackoverflow.com/questions/54665416/… — Cyrus Mohammadian
– Cyrus Mohammadian, Commented Feb 13, 2019 at 8:17

dww · Accepted Answer · 2019-02-13 03:00:41Z

In base R, you can use the same duplicated test as your linked question on 'partial' duplicates, but then exclude values that are the same

df$numnum = as.numeric(as.character(df$num))
df$absnum = abs(df$numnum)
df$newcol = (duplicated(df[,c('name','state', 'absnum')]) | 
  duplicated(df[,c('name','state', 'absnum')], fromLast = T)) &
  !(duplicated(df$numnum) | duplicated(df$numnum, fromLast = T))
#            name      state  num    date numnum absnum newcol
# 1 Richard Nixon California -258   day 2   -258    258   TRUE
# 2  Bill Clinton    Indiana  123  day 15    123    123  FALSE
# 3   George Bush    Florida   42   day 3     42     42  FALSE
# 4 Richard Nixon California  258  day 45    258    258   TRUE
# 5  Bill Clinton    Indiana  123 day 100    123    123  FALSE

Community · Accepted Answer · 2020-06-20 09:12:55Z

One option would be to convert the 'num' to numeric type first, create another column with absolute values ('num1'), grouped by 'name', 'state', 'num1', mutate to create the bool column by checking the number of rows equal to 2 (n() == 2) and the number of distinct sign of 'num' greater than 1

library(tidyverse)
df %>%
    mutate(num = as.numeric(num), num1 = abs(num)) %>% 
    group_by(name, state, num1) %>% 
    mutate(newcol = n() == 2 & n_distinct(sign(num)) > 1) %>%
    ungroup %>% 
    select(-num1)
# A tibble: 5 x 5
#  name          state        num date    newcol 
#  <chr>         <chr>      <dbl> <chr>   <lgl>
#1 Richard Nixon California  -258 day 2   TRUE 
#2 Bill Clinton  Indiana      123 day 15  FALSE
#3 George Bush   Florida       42 day 3   FALSE
#4 Richard Nixon California   258 day 45  TRUE 
#5 Bill Clinton  Indiana      123 day 100 FALSE

NOTE: cbind creates a matrix and matrix can have only single type. Therefore, if there is any character column or element, the whole matrix becomes character class. Wrapping it with data.frame, propagates that and can convert to factor (stringsAsFactors = TRUE - by default) or character (if we change it to FALSE)

data

df <- data.frame(name, state, num, date, stringsAsFactors = FALSE)

My apologies, it was fine. and thanks for your very thorough explanation.
@CyrusMohammadian. I changed n() == 2 based on your comments to @Chase
@In retrospect, I need some type of sequential process that identifies matched rows that are additive inverses of one another along the num variable. I opened up a new SO to that end, and i appreciate all the help , stackoverflow.com/questions/54665416/…
@CyrusMohammadian. Thanks for the updating me. I posted a solution in the link. Hope it helps

Collectives™ on Stack Overflow

How to identify mirrored duplicates of rows in R

2 Answers 2

Comments

data

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

data

4 Comments

Linked

Related