I am at a loss. I need the following sequence of calculations across two dataframes;
- For each unique combination of ID1 and ID2 in df2 there exists a value X; multiply this X with all values of X in df1 that have the same ID1 but have a different ID2
- These values are then each divided by the corresponding Y in the same row in df2
- These values are summed per unique ID1, ID2 combination to become value Z in df2
If X is NA, Z in the same row will also be NA.
Both dataframes have 30 variables and hundreds of thousands of observations. I have now spent quite some time searching on Stackoverflow and other websites, used forloops, apply, merge,grepl,match and other functions, but have not been able to get an efficient code. I hope this is clear and many thanks in advance for your help! It would greatly help me with my final analysis for my Master's thesis.
Here's an example of DF1:
ID1 ID2 X Y Z
i1000 1000 5 15 0
i1000 2000 NA 15 0
i1000 3000 1 15 0
k4000 1000 4 10 0
k4000 2000 2 10 0
k4000 4000 3 10 0
j1944 2000 7 40 0
j1944 3000 1 40 0
j1944 4000 NA 40 0
p2049 1000 6 55 0
p2049 3000 2 55 0
p2049 4000 5 55 0
Of Df2;
ID1 ID2 X Y Z
i1000 1000 2 10 NA
i1000 2000 2 10 NA
i1000 3000 3 10 NA
k4000 1000 NA 30 NA
k4000 2000 2 30 NA
k4000 4000 5 30 NA
j1944 2000 1 20 NA
j1944 3000 3 20 NA
j1944 4000 2 20 NA
p2049 1000 4 55 NA
p2049 3000 NA 55 NA
p2049 4000 2 55 NA
And the result I'm trying to get;
ID1 ID2 X Y Z
i1000 1000 2 10 0.2 ###((2*1)/10)
i1000 2000 2 10 1.2 ###((2*5)+(2*1)/10)
i1000 3000 3 10 1.5 ###((3*5)/10)
k4000 1000 NA 30 NA ### X= NA, etc
k4000 2000 2 30 0.47
k4000 4000 5 30 1
j1944 2000 1 20 0.2
j1944 3000 3 20 1.05
j1944 4000 2 20 0.8
p2049 1000 4 55 0.51
p2049 3000 NA 55 NA
p2049 4000 2 55 0.29
CSV for the example data;
###df1
"","ID1","ID2","X","Y","Z"
"1","i1000",1000,5,15,0
"2","i1000",2000,NA,15,0
"3","i1000",3000,1,15,0
"4","k4000",1000,4,10,0
"5","k4000",2000,2,10,0
"6","k4000",4000,3,10,0
"7","j1944",2000,7,40,0
"8","j1944",3000,1,40,0
"9","j1944",4000,NA,40,0
"10","p2049",1000,6,55,0
"11","p2049",3000,2,55,0
"12","p2049",4000,5,55,0
###df2
"","ID1","ID2","X","Y","Z"
"1","i1000",1000,2,10,NA
"2","i1000",2000,2,10,NA
"3","i1000",3000,3,10,NA
"4","k4000",1000,NA,30,NA
"5","k4000",2000,2,30,NA
"6","k4000",4000,5,30,NA
"7","j1944",2000,1,20,NA
"8","j1944",3000,3,20,NA
"9","j1944",4000,2,20,NA
"10","p2049",1000,4,55,NA
"11","p2049",3000,NA,55,NA
"12","p2049",4000,2,55,NA
###results
"","ID1","ID2","X","Y","Z"
"1","i1000",1000,2,10,0.2
"2","i1000",2000,2,10,1.2
"3","i1000",3000,3,10,1.5
"4","k4000",1000,NA,30,NA
"5","k4000",2000,2,30,0.47
"6","k4000",4000,5,30,1
"7","j1944",2000,1,20,0.2
"8","j1944",3000,3,20,1.05
"9","j1944",4000,2,20,0.8
"10","p2049",1000,4,55,0.51
"11","p2049",3000,NA,55,NA
"12","p2049",4000,2,55,0.29