0

My goal is to merge two large dataframes based on column genus, but with the special condition of not duplicating rows (not solved in first try); and also preserving more information from both dataframes (not solved in second try), please see desired output:

chromdata <- read.table(text="
 genus sp
1      Acosta       Acosta_1
2    Aguilera     Aguilera_1
3      Acosta       Acosta_2
4    Aguilera     Aguilera_2
5       other              1   # EDIT: new rows    
6       other              2",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

treedata <- read.table(text="
 genus sp
1      Acosta       Acosta_3
2    Aguilera     Aguilera_3
3      Acosta       Acosta_4
4    Aguilera     Aguilera_4
5       other              3",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

#First try
merge(chromdata,treedata, by="genus", all=F)

#Second try
chromdata$sp2<-treedata$sp[match(chromdata$genus, treedata$genus)]
chromdata
     genus         sp        sp2
1   Acosta   Acosta_1   Acosta_3
2 Aguilera Aguilera_1 Aguilera_3
3   Acosta   Acosta_2   Acosta_3 #Acosta_4 missing
4 Aguilera Aguilera_2 Aguilera_3 # Aguilera_4 missing
5    other          1          3
6    other          2          3 

Desired Output:

     genus         sp        sp2
1   Acosta   Acosta_1   Acosta_3
2 Aguilera Aguilera_1 Aguilera_3
3   Acosta   Acosta_2   Acosta_4
4 Aguilera Aguilera_2 Aguilera_4
5    other          1          3 # EDIT: new rows
6    other          2          3

2 Answers 2

1

You can add another column to merge on:

library(data.table)
merge(
  transform(chromdata, r = rowid(genus)), 
  transform(treedata, r = rowid(genus)), 
  by=c("r", "genus")
)

  r    genus       sp.x       sp.y
1 1   Acosta   Acosta_1   Acosta_3
2 1 Aguilera Aguilera_1 Aguilera_3
3 2   Acosta   Acosta_2   Acosta_4
4 2 Aguilera Aguilera_2 Aguilera_4

You could also get rowid like ave(genus, genus, FUN = seq_along) or many other ways if you don't want to load data.table.

Sign up to request clarification or add additional context in comments.

2 Comments

I found a case in which the answer does not work, see edit.
solved adding , all.x=T) in your function; library(tidyverse); df %>% group_by(genus) %>% fill(sp.y)
0

I want to elaborate more on the data.table approach.

First of all, you could read your data then directly transform it to a data.table object:

library(data.table)

chromdata <- as.data.table(read.table(text="
 genus sp
                        1      Acosta       Acosta_1
                        2    Aguilera     Aguilera_1
                        3      Acosta       Acosta_2
                        4    Aguilera     Aguilera_2",header=TRUE,fill=TRUE,stringsAsFactors=FALSE))

treedata <- as.data.table(read.table(text="
                       genus sp
                       1      Acosta       Acosta_3
                       2    Aguilera     Aguilera_3
                       3      Acosta       Acosta_4
                       4    Aguilera     Aguilera_4",header=TRUE,fill=TRUE,stringsAsFactors=FALSE))

After that, you need an extra column for the merge operation required to achieve your desired output :

chromdata[, N := seq_len(.N), genus]
treedata[, N := seq_len(.N), genus]

These lines gives you the row ids within groups.

Lastly, with the help of data.table package, we can merge these two tables on common columns:

chromdata[treedata, on = c("genus", "N")]

The final output :

      genus         sp N       i.sp
1:   Acosta   Acosta_1 1   Acosta_3
2: Aguilera Aguilera_1 1 Aguilera_3
3:   Acosta   Acosta_2 2   Acosta_4
4: Aguilera Aguilera_2 2 Aguilera_4

3 Comments

this is similar to the rowidv function.
@Ferroao I know, as I said I wanted to elaborate more on data.table perspective
rowidv is also data.table

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.