0

I have two files (file1.csv and file2.csv). As shown below, file1 contains two columns date and variable x1 that has 365 observations (whole year). file 2 contains column date as file1 and many other variables. I'm interested only in variable x45 that has 24 observations only (2 observations each month).

file1

date     x1
1/01/2005   33
2/01/2005   24
3/01/2005    72
31/12/2005   52

file 2

date     x2      x3     x45
1/01/2005               115
5/02/2005                125
13/04/2005               127
31/12/2005               138

so I'd like to add column x45 to file1.csv to look like

date    x1    x45
1/01/2005   33  115
2/01/2005   24    NA
3/01/2005    72   NA
31/12/2005   52           138

I have tried using

file1= read.csv("D:/file1.csv")
file2= read.csv("D:/file2.csv")
file3 = merge(file1, file2)

However, file 3 has only 24 rows (observations) and omits the rest of observations in file 1.

Any help to get the result as described above would be much appreciated.

1
  • @RichardScriven No they aren't. I just ignored writing their values as I don't need them. Commented Jan 25, 2015 at 4:14

3 Answers 3

2

You can try left_join

library(dplyr)
left_join(df1, df2[c('date', 'x45')], by='date')
#         date x1 x45
#1  1/01/2005 33 115
#2  2/01/2005 24  NA
#3  3/01/2005 72  NA
#4 31/12/2005 52 138

Or using merge

merge(df1, df2[c('date', 'x45')], all.x=TRUE)
#       date x1 x45
#1  1/01/2005 33 115
#2  2/01/2005 24  NA
#3  3/01/2005 72  NA
#4 31/12/2005 52 138

Update

The left_join from dplyr and join from plyr keep the original order. If you need to keep order in merge, one option is to create an "indx" in "df1" and after the merge, the original order can be retained using the "indx"

df1$indx <- 1:nrow(df1)
 merge(df1, df2[c('date', 'x45')], all.x=TRUE)[order(df1$indx),-3]
    date x1 x45
 #1  1/01/2005 33 115
 #2  2/01/2005 24  NA
 #3  3/01/2005 72  NA
 #4 31/12/2005 52 138

Or using join from plyr

library(plyr)
join(df1, df2[c('date', 'x45')], by='date', type='left')

data

df1 <- structure(list(date = c("1/01/2005", "2/01/2005", "3/01/2005", 
"31/12/2005"), x1 = c(33L, 24L, 72L, 52L)), .Names = c("date", 
"x1"), class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(date = c("1/01/2005", "5/02/2005", "13/04/2005", 
"31/12/2005"), x2 = c(NA, NA, NA, NA), x3 = c(NA, NA, NA, NA), 
x45 = c(115L, 125L, 127L, 138L)), .Names = c("date", "x2", 
 "x3", "x45"), class = "data.frame", row.names = c(NA, -4L))
Sign up to request clarification or add additional context in comments.

10 Comments

left_join didn't work with me. However, merge did work. Many thanks for your time and help.
@aelwan Not sure why it didn't work. What is the error message?
there was no error message but all the values of x45 were NA.
what if I have different x45 columns in different files?
@Akurn your code changes the order of the rows. Any idea to avoid that?
|
2

Just for completeness, you can join and update file1 both very fast and by reference (without using <-) using the data.table package

library(data.table)
setkey(setDT(file1), date)[file2, x45 := i.x45]
file1
#          date x1 x45
# 1:  1/01/2005 33 115
# 2:  2/01/2005 24  NA
# 3:  3/01/2005 72  NA
# 4: 31/12/2005 52 138

Here you key file1 by the date column and perform a binary join on file2 while pulling only the x45 column

Comments

0

The following will also work, without requiring a package, and without changing the original order of the rows in df1:

df1
#        date x1
#2  1/01/2005 33
#3  2/01/2005 24
#4  3/01/2005 72
#5 31/12/2005 52
df2
#        date x45
#1  1/01/2005  33
#2  2/01/2005  24
#3  3/01/2005  72
#4 31/12/2005  52

df1$x45 <- df2$x45[match(df1$date, df2$date)]

df1
#        date x1 x45
#2  1/01/2005 33  33
#3  2/01/2005 24  24
#4  3/01/2005 72  72
#5 31/12/2005 52  52

1 Comment

Thanks for your help. However, I got NA in all x45 column