merging two dataframes based on one column without duplicating rows and preserving more data

Question

My goal is to merge two large dataframes based on column genus, but with the special condition of not duplicating rows (not solved in first try); and also preserving more information from both dataframes (not solved in second try), please see desired output:

chromdata <- read.table(text="
 genus sp
1      Acosta       Acosta_1
2    Aguilera     Aguilera_1
3      Acosta       Acosta_2
4    Aguilera     Aguilera_2
5       other              1   # EDIT: new rows    
6       other              2",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

treedata <- read.table(text="
 genus sp
1      Acosta       Acosta_3
2    Aguilera     Aguilera_3
3      Acosta       Acosta_4
4    Aguilera     Aguilera_4
5       other              3",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

#First try
merge(chromdata,treedata, by="genus", all=F)

#Second try
chromdata$sp2<-treedata$sp[match(chromdata$genus, treedata$genus)]
chromdata
     genus         sp        sp2
1   Acosta   Acosta_1   Acosta_3
2 Aguilera Aguilera_1 Aguilera_3
3   Acosta   Acosta_2   Acosta_3 #Acosta_4 missing
4 Aguilera Aguilera_2 Aguilera_3 # Aguilera_4 missing
5    other          1          3
6    other          2          3

Desired Output:

     genus         sp        sp2
1   Acosta   Acosta_1   Acosta_3
2 Aguilera Aguilera_1 Aguilera_3
3   Acosta   Acosta_2   Acosta_4
4 Aguilera Aguilera_2 Aguilera_4
5    other          1          3 # EDIT: new rows
6    other          2          3

Frank · Accepted Answer · 2018-10-05 19:08:37Z

1

You can add another column to merge on:

library(data.table)
merge(
  transform(chromdata, r = rowid(genus)), 
  transform(treedata, r = rowid(genus)), 
  by=c("r", "genus")
)

  r    genus       sp.x       sp.y
1 1   Acosta   Acosta_1   Acosta_3
2 1 Aguilera Aguilera_1 Aguilera_3
3 2   Acosta   Acosta_2   Acosta_4
4 2 Aguilera Aguilera_2 Aguilera_4

You could also get rowid like ave(genus, genus, FUN = seq_along) or many other ways if you don't want to load data.table.

answered Oct 5, 2018 at 19:08

Frank

66.9k8 gold badges104 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ferroao Over a year ago

I found a case in which the answer does not work, see edit.

Ferroao Over a year ago

solved adding , all.x=T) in your function; library(tidyverse); df %>% group_by(genus) %>% fill(sp.y)

Cem · Accepted Answer · 2018-10-05 20:12:25Z

I want to elaborate more on the data.table approach.

First of all, you could read your data then directly transform it to a data.table object:

library(data.table)

chromdata <- as.data.table(read.table(text="
 genus sp
                        1      Acosta       Acosta_1
                        2    Aguilera     Aguilera_1
                        3      Acosta       Acosta_2
                        4    Aguilera     Aguilera_2",header=TRUE,fill=TRUE,stringsAsFactors=FALSE))

treedata <- as.data.table(read.table(text="
                       genus sp
                       1      Acosta       Acosta_3
                       2    Aguilera     Aguilera_3
                       3      Acosta       Acosta_4
                       4    Aguilera     Aguilera_4",header=TRUE,fill=TRUE,stringsAsFactors=FALSE))

After that, you need an extra column for the merge operation required to achieve your desired output :

chromdata[, N := seq_len(.N), genus]
treedata[, N := seq_len(.N), genus]

These lines gives you the row ids within groups.

Lastly, with the help of data.table package, we can merge these two tables on common columns:

chromdata[treedata, on = c("genus", "N")]

The final output :

      genus         sp N       i.sp
1:   Acosta   Acosta_1 1   Acosta_3
2: Aguilera Aguilera_1 1 Aguilera_3
3:   Acosta   Acosta_2 2   Acosta_4
4: Aguilera Aguilera_2 2 Aguilera_4

@Ferroao I know, as I said I wanted to elaborate more on data.table perspective

Collectives™ on Stack Overflow

merging two dataframes based on one column without duplicating rows and preserving more data

2 Answers 2

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Linked

Related