Both Uwe's and GKi's answer are correct. Gki received the bounty because Uwe was late for that, but Uwe's solution runs about 15x as fast
I have two datasets that contain scores for different patients on multiple measuring moments like so:
df1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"),
"Days" = c(0,25,235,353,100,538),
"Score" = c(NA,2,3,4,5,6),
stringsAsFactors = FALSE)
df2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"),
"Days" = c(0,25,248,353,100,150,503),
"Score" = c(1,10,3,4,5,7,6),
stringsAsFactors = FALSE)
> df1
ID Days Score
1 patient1 0 NA
2 patient1 25 2
3 patient1 235 3
4 patient1 353 4
5 patient2 100 5
6 patient3 538 6
> df2
ID Days Score
1 patient1 0 1
2 patient1 25 10
3 patient1 248 3
4 patient1 353 4
5 patient2 100 5
6 patient2 150 7
7 patient3 503 6
Column ID shows the patient ID, column Days shows the moment of measurement (Days since patient inclusion) and column Score shows the measured score. Both datasets show the same data but in different moments in time (df1 was 2 years ago, df2 has the same data with updates from this year).
I have to compare the scores for each patient and each moment between both datasets. However, in some cases the Days variable has minor changes over time, so comparing the dataset by a simple join does not work. Example:
library(dplyr)
> full_join(df1, df2, by=c("ID","Days")) %>%
+ arrange(.[[1]], as.numeric(.[[2]]))
ID Days Score.x Score.y
1 patient1 0 NA 1
2 patient1 25 2 10
3 patient1 235 3 NA
4 patient1 248 NA 3
5 patient1 353 4 4
6 patient2 100 5 5
7 patient2 150 NA 7
8 patient3 503 NA 6
9 patient3 538 6 NA
Here, rows 3 and 4 contain data for the same measurement (with score 3) but are not joined because the values for the Days column are different (235 vs 248).
Question: I'm looking for a way to set a threshold on the second column (say 30 days) which would result in the following output:
> threshold <- 30
> *** insert join code ***
ID Days Score.x Score.y
1 patient1 0 NA 1
2 patient1 25 2 10
3 patient1 248 3 3
4 patient1 353 4 4
5 patient2 100 5 5
6 patient2 150 NA 7
7 patient3 503 NA 6
8 patient3 538 6 NA
This output shows that rows 3 and 4 of the previous output have been merged (because 248-235 < 30) and have taken the value for Days of the second df (248).
Three main conditions to keep in mind are:
- Consecutive days that are within the threshold from within the same df (rows 1 and 2) are not merged.
- In some cases, up to four values for the
Daysvariable exist in the same dataframe and thus should not be merged. It might be the case that one of these values does exist within the treshold in the other dataframe, and these will have to be merged. See row 3 in the example below. - Each score/days/patient combination can only be used once. If a merge satisfies all conditions but there is still a double-merge possible, the first one should be used.
> df1
ID Days Score
1 patient1 0 1
2 patient1 5 2
3 patient1 10 3
4 patient1 15 4
5 patient1 50 5
> df2
ID Days Score
1 patient1 0 1
2 patient1 5 2
3 patient1 12 3
4 patient1 15 4
5 patient1 50 5
> df_combined
ID Days Score.x Score.y
1 patient1 0 1 1
2 patient1 5 2 2
3 patient1 12 3 3
4 patient1 15 4 4
5 patient1 50 5 5
EDIT FOR CHINSOON12
> df1
ID Days Score
1: patient1 0 1
2: patient1 116 2
3: patient1 225 3
4: patient1 309 4
5: patient1 351 5
6: patient2 0 6
7: patient2 49 7
> df2
ID Days Score
1: patient1 0 11
2: patient1 86 12
3: patient1 195 13
4: patient1 279 14
5: patient1 315 15
6: patient2 0 16
7: patient2 91 17
8: patient2 117 18
I wrapped your solution in a function like so:
testSO2 <- function(DT1,DT2) {
setDT(DT1);setDT(DT2)
names(DT1) <- c("ID","Days","X")
names(DT2) <- c("ID","Days","Y")
DT1$Days <- as.numeric(DT1$Days)
DT2$Days <- as.numeric(DT2$Days)
DT1[, c("s1", "e1", "s2", "e2") := .(Days - 30L, Days + 30L, Days, Days)]
DT2[, c("s1", "e1", "s2", "e2") := .(Days, Days, Days - 30L, Days + 30L)]
byk <- c("ID", "s1", "e1")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o1 <- foverlaps(DT1, DT2)
byk <- c("ID", "s2", "e2")
setkeyv(DT1, byk)
setkeyv(DT2, byk)
o2 <- foverlaps(DT2, DT1)
olaps <- funion(o1, setcolorder(o2, names(o1)))[
is.na(Days), Days := i.Days]
outcome <- olaps[, {
if (all(!is.na(Days)) && any(Days == i.Days)) {
s <- .SD[Days == i.Days, .(Days = Days[1L],
X = X[1L],
Y = Y[1L])]
} else {
s <- .SD[, .(Days = max(Days, i.Days), X, Y)]
}
unique(s)
},
keyby = .(ID, md = pmax(Days, i.Days))][, md := NULL][]
return(outcome)
}
Which results in:
> testSO2(df1,df2)
ID Days X Y
1: patient1 0 1 11
2: patient1 116 2 12
3: patient1 225 3 13
4: patient1 309 4 14
5: patient1 315 4 15
6: patient1 351 5 NA
7: patient2 0 6 16
8: patient2 49 7 NA
9: patient2 91 NA 17
10: patient2 117 NA 18
As you can see, rows 4 and 5 are wrong. The value for Score in df1 is used twice (4). The correct output around those rows should be as follows, as each score (X or Y in this case) can only be used once:
ID Days X Y
4: patient1 309 4 14
5: patient1 315 NA 15
6: patient1 351 5 NA
Code for dataframes below.
df1 <- data.frame(
ID = rep(c("patient1", "patient2"), c(5L, 2L)),
Days = c("0", "116", "225", "309", "351", "0", "49"),
Score = 1:7
)
df2 <- data.frame(
ID = rep(c("patient1", "patient2"), c(5L, 3L)),
Days = c("0", "86", "195", "279", "315", "0", "91", "117"),
Score = 11:18
)
Score.x = 3andScore.y = 4? Would you still want to discard one of the measurements in that case?patient1 248 3 4ID = patient1,Days = 13,Score.x = 1, andScore.y = NA? Would you only want row 2 to merge with row 1.5, even though it is still also within 30 days of row 1?