0

I am at a loss. I need the following sequence of calculations across two dataframes;

  1. For each unique combination of ID1 and ID2 in df2 there exists a value X; multiply this X with all values of X in df1 that have the same ID1 but have a different ID2
  2. These values are then each divided by the corresponding Y in the same row in df2
  3. These values are summed per unique ID1, ID2 combination to become value Z in df2

If X is NA, Z in the same row will also be NA.

Both dataframes have 30 variables and hundreds of thousands of observations. I have now spent quite some time searching on Stackoverflow and other websites, used forloops, apply, merge,grepl,match and other functions, but have not been able to get an efficient code. I hope this is clear and many thanks in advance for your help! It would greatly help me with my final analysis for my Master's thesis.

Here's an example of DF1:

ID1      ID2    X   Y   Z
i1000   1000    5   15  0
i1000   2000    NA  15  0
i1000   3000    1   15  0
k4000   1000    4   10  0
k4000   2000    2   10  0
k4000   4000    3   10  0
j1944   2000    7   40  0
j1944   3000    1   40  0
j1944   4000    NA  40  0
p2049   1000    6   55  0
p2049   3000    2   55  0
p2049   4000    5   55  0

Of Df2;

ID1      ID2    X   Y   Z
i1000   1000    2   10  NA
i1000   2000    2   10  NA
i1000   3000    3   10  NA
k4000   1000    NA  30  NA
k4000   2000    2   30  NA
k4000   4000    5   30  NA
j1944   2000    1   20  NA
j1944   3000    3   20  NA
j1944   4000    2   20  NA
p2049   1000    4   55  NA
p2049   3000    NA  55  NA
p2049   4000    2   55  NA

And the result I'm trying to get;

ID1      ID2    X   Y   Z
i1000   1000    2   10  0.2 ###((2*1)/10)
i1000   2000    2   10  1.2 ###((2*5)+(2*1)/10)
i1000   3000    3   10  1.5 ###((3*5)/10)
k4000   1000    NA  30  NA  ### X= NA, etc
k4000   2000    2   30  0.47
k4000   4000    5   30  1
j1944   2000    1   20  0.2
j1944   3000    3   20  1.05
j1944   4000    2   20  0.8
p2049   1000    4   55  0.51
p2049   3000    NA  55  NA
p2049   4000    2   55  0.29

CSV for the example data;

###df1
"","ID1","ID2","X","Y","Z"
"1","i1000",1000,5,15,0
"2","i1000",2000,NA,15,0
"3","i1000",3000,1,15,0
"4","k4000",1000,4,10,0
"5","k4000",2000,2,10,0
"6","k4000",4000,3,10,0
"7","j1944",2000,7,40,0
"8","j1944",3000,1,40,0
"9","j1944",4000,NA,40,0
"10","p2049",1000,6,55,0
"11","p2049",3000,2,55,0
"12","p2049",4000,5,55,0
###df2
"","ID1","ID2","X","Y","Z"
"1","i1000",1000,2,10,NA
"2","i1000",2000,2,10,NA
"3","i1000",3000,3,10,NA
"4","k4000",1000,NA,30,NA
"5","k4000",2000,2,30,NA
"6","k4000",4000,5,30,NA
"7","j1944",2000,1,20,NA
"8","j1944",3000,3,20,NA
"9","j1944",4000,2,20,NA
"10","p2049",1000,4,55,NA
"11","p2049",3000,NA,55,NA
"12","p2049",4000,2,55,NA
###results
"","ID1","ID2","X","Y","Z"
"1","i1000",1000,2,10,0.2
"2","i1000",2000,2,10,1.2
"3","i1000",3000,3,10,1.5
"4","k4000",1000,NA,30,NA
"5","k4000",2000,2,30,0.47
"6","k4000",4000,5,30,1
"7","j1944",2000,1,20,0.2
"8","j1944",3000,3,20,1.05
"9","j1944",4000,2,20,0.8
"10","p2049",1000,4,55,0.51
"11","p2049",3000,NA,55,NA
"12","p2049",4000,2,55,0.29

1 Answer 1

0

Let us merge the variables that we need for the calculations into a temporary file first:

tmp = merge(data.frame(ID1 = df2$ID1, ID2_2 = df2$ID2, X2 = df2$X),
            data.frame(ID1 = df1$ID1, ID2_1 = df1$ID2, X1 = df1$X))

> head(tmp)
    ID1 ID2_2 X2 ID2_1 X1
1 i1000  1000  2  1000  5
2 i1000  1000  2  2000 NA
3 i1000  1000  2  3000  1
4 i1000  2000  2  1000  5
5 i1000  2000  2  2000 NA
6 i1000  2000  2  3000  1

You said "... all values of X in df1 that have the same ID1 but have a different ID2." Then, we don't need the rows in which ID2_2 equals ID21. Let's drop them.

tmp = tmp[which(tmp$ID2_1 != tmp$ID2_2), ]

Also, let's replace NA's in the temporary file.

tmp[is.na(tmp)] = 0

Now we can multiply Xs in rows with the same ID1's and different ID2's

tmp$X = tmp$X1 * tmp$X2

We need ID1, ID2, and X from this temporary file. Let's drop the other variables.

tmp = with(tmp, data.frame(ID1 = ID1, ID2 = ID2_2, X = X))

We need the Y variable from df2. So, get that variable by merging a part of df2 with tmp

tmp = merge(tmp, with(df2, data.frame(ID1 = ID1, ID2 = ID2, Y = Y)))

Divide the new X (which is the multiplication of X's in rows that have the same ID1's and different ID2's) with Y

tmp$Z = tmp$X / tmp$Y

Sum all the products by unique ID1 and ID2 combinations

tmp = aggregate(cbind(Z = Z) ~ ID1 + ID2, data = tmp, FUN="sum")

Create the final dataset ...

res = merge(with(df2, data.frame(ID1 = ID1, ID2 = ID2, X = X, Y = Y)), tmp)

... and set Z's in rows which X's are NA to NA

res[which(is.na(res$X)), ]$Z = NA

> res
     ID1  ID2  X  Y         Z
1  i1000 1000  2 10 0.2000000
2  i1000 2000  2 10 1.2000000
3  i1000 3000  3 10 1.5000000
4  j1944 2000  1 20 0.0500000
5  j1944 3000  3 20 1.0500000
6  j1944 4000  2 20 0.8000000
7  k4000 1000 NA 30        NA
8  k4000 2000  2 30 0.4666667
9  k4000 4000  5 30 1.0000000
10 p2049 1000  4 55 0.5090909
11 p2049 3000 NA 55        NA
12 p2049 4000  2 55 0.2909091

The resulting data frame is the same as yours.


The following code splits df2 into smaller chunks and processes each chunk separately. Then, it adds them all.

ID1s = unique(df2$ID1)
chunk_size = length(ID1s) / 2  # edit this line to specify the chunk size 

res = data.frame(ID1 = NA, ID2 = NA, X = NA, Y = NA, Z = NA)
while (length(ID1s) > 0) {
    tmp = df2[which(df2$ID1 %in% ID1s[1:chunk_size]), ]
    tmp = merge(data.frame(ID1 = tmp$ID1, ID2_2 = tmp$ID2, X2 = tmp$X),
                data.frame(ID1 = df1$ID1, ID2_1 = df1$ID2, X1 = df1$X))

    tmp = tmp[which(tmp$ID2_1 != tmp$ID2_2), ]
    tmp[is.na(tmp)] = 0
    tmp$X = tmp$X1 * tmp$X2
    tmp = with(tmp, data.frame(ID1 = ID1, ID2 = ID2_2, X = X))

    tmp = merge(tmp, with(df2, data.frame(ID1 = ID1, ID2 = ID2, Y = Y)))
    tmp$Z = tmp$X / tmp$Y

    tmp = aggregate(cbind(Z = Z) ~ ID1 + ID2, data = tmp, FUN="sum")
    res = rbind(res, merge(with(df2, data.frame(ID1 = ID1, ID2 = ID2, X = X, Y = Y)), tmp))
    
    ID1s = ID1s[-(1:chunk_size)]
}
res = res[-1, ]
res[which(is.na(res$X)), ]$Z = NA
Sign up to request clarification or add additional context in comments.

2 Comments

When attempting the first merge, R runs into an error 'Error: cannot allocate vector of size 4.0 Gb' as the temp df would contain millions of rows. is there perhaps ways to overcome this problem by using a more efficient method? Edit: I am running 64 bit Windows with 8GB of RAM, even when allocating 7GB of RAM, R can not perform the merge
When I have to deal with very large datasets, I split them into smaller chunks. The following code splits df2 into smaller chunks and processes each chunk separately. Please see my answer again; I edited it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.