Merge two dataframes with repeated columns

Question

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:

names(data.jan)

ID     AGE      CITY      GENDER

names(data.feb)

ID     AGE      CITY      GENDER

To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:

Considering a "customer X",I have three possible scenarios:

1- Customer X is listed in the january database, but he left and now is not listed in february 2- Customer X is listed in both january and february databases 3- Customer X entered the database in february, so he is not listed in january

I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.

When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html

Merge options

data <- merge(data.jan,data.feb, by="ID", all=TRUE)

Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:

data[1,]

ID     AGE.x      CITY.x      GENDER.x       AGE.y      CITY.y      GENDER.y
123      25         NY           M            25          NY            M

I think that what would work here is to merge both databases with this type of join:

enter image description here

Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.

Thanks,

Bernardo

From manual for merge: "By default the data frames are merged on the columns with names they both have..." So you might want to drop by= option, so it will merge on common columns. — zx8754
– zx8754, Commented Dec 4, 2013 at 11:48
Do your columns for all files always the same as ID AGE CITY GENDER? — zx8754
– zx8754, Commented Dec 4, 2013 at 11:50
IMHO, merging all your data seems like a bad approach. You should instead concatenate all your files into one long data.frame: with many records per customer, but just four variables: dat <- do.call(rbind, all_your_files). You will still be able to do per-customer analysis using things like aggregate, plyr, or data.table. — flodel
– flodel, Commented Dec 4, 2013 at 12:01
Actually, I was trying to avoid building a large database, since each .csv contains more than 500.000 rows. Multiplying that by eleven months gives me a total of 5.500.000 rows with 13 columns (I tried to simplify the data in my question, but these are the real numbers). Probably my poor PC won't be able to run this in a reasonable time. — Bernardo
– Bernardo, Commented Dec 4, 2013 at 13:04
A few million rows shouldn't be problematic with data like this on even very modest computers. Since it's not hard to try, give it a shot and reopen back before deciding that's not the right route. — JBecker
– JBecker, Commented Dec 4, 2013 at 13:46

user1317221_G · Accepted Answer · 2013-12-04 12:40:41Z

1

 d1 <- data.frame(x=1:9,y=1:9,z=1:9)
 d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
 d3 <- merge(d1,d2, by="x", all=TRUE) #merge


# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#>  d3[, 1:3]
#    x y.x z.x
#1   1   1   1
#2   2   2   2
#3   3   3   3
#4   4   4   4
#5   5   5   5
#6   6   6   6
#7   7   7   7
#8   8   8   8
#9   9   9   9
#10 10  20  30

This may be tiresome for more than 2 months though, perhaps you should consider @flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.

edited Dec 4, 2013 at 12:40

answered Dec 4, 2013 at 12:33

user1317221_G

15.5k3 gold badges54 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bernardo Over a year ago

Thanks for your answer! I will see if my PC can run flodel's suggestion. Regarding the NA cases, I haven't mentioned them because there are no such cases in any of my databases, but I will edit my question to clarify that!

Troy · Accepted Answer · 2013-12-04 12:09:12Z

0

Try:

data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")

although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames

edited Dec 4, 2013 at 12:09

answered Dec 4, 2013 at 11:55

Troy

8,70133 silver badges33 bronze badges

1 Comment

zx8754 Over a year ago

You meant to say: merge(data.jan,data.frame(ID=data.feb$ID), by="ID")?

jlhoward · Accepted Answer · 2013-12-04 13:40:05Z

@user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:

library(data.table)
#  some sample data
jan <- data.table(id=1:10,  age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)

setkey(jan,id)
setkey(feb,id)

join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]

Edit: This adds processing for multiple months.

f <- function(x,y) {
  setkey(x,id)
  setkey(y,id)
  join <- data.table(merge(x,y,by="id",all=T))
  join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
  join[,names(join)[5:7]:=NULL]                # get rid of extra columns
  setnames(join,2:4,c("age","city","gender"))  # rename columns that remain
  return(join)
}

Reduce("f",list(jan,feb,mar))

Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.

Collectives™ on Stack Overflow

Merge two dataframes with repeated columns

3 Answers 3

1 Comment

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Linked

Related