Which components of this r loop are inefficient?

Question

I realize that there is a surfeit of posts on SE/SO about optimizing one's R for() loops. I'm afraid, though, that I don't understand all of the guidance in the answers there, or how to apply their lessons to the loop I'm working on.

First, the code:

HHCovarEvals <- data.frame()
for(k in 1:length(hhcovarmodels))
{
  q.lm <- lm(as.formula(c("DeltaHHCGI~",hhcovarmodels[k])))
  q.bptest <- bptest(q.lm)
  HHCovarEvals[k,1] <- hhcovarmodels[k]
  HHCovarEvals[k,2] <- length(split(seq(nchar(hhcovarmodels[k])),unlist(strsplit(hhcovarmodels[k],"")))[['+']])+2
  HHCovarEvals[k,3] <- summary(q.lm)$r.squared
  HHCovarEvals[k,4] <- summary(q.lm)$adj.r.squared
  HHCovarEvals[k,5] <- AIC(q.lm)
  HHCovarEvals[k,6] <- BIC(q.lm)
  HHCovarEvals[k,7] <- NA
  HHCovarEvals[k,8] <- max(colldiag(q.lm)$condindx)
  if(HHCovarEvals[k,2]==2) HHCovarEvals[k,9]<-NA else
    HHCovarEvals[k,9] <- max(vif(q.lm))
  HHCovarEvals[k,10] <- q.bptest$statistic
  HHCovarEvals[k,11] <- q.bptest$p.value
  q.moran <- lm.morantest(q.lm,q.listw)
  HHCovarEvals[k,12] <- q.moran$statistic[1,1]
  HHCovarEvals[k,13] <- q.moran$p.value[1,1]
  q.lagrange <- lm.LMtests(q.lm,q.listw,test=c("LMerr","RLMerr","LMlag","RLMlag","SARMA"))
  HHCovarEvals[k,14] <- q.lagrange$LMerr$statistic
  HHCovarEvals[k,15] <- q.lagrange$LMerr$p.value
  HHCovarEvals[k,16] <- q.lagrange$RLMerr$statistic
  HHCovarEvals[k,17] <- q.lagrange$RLMerr$p.value
  HHCovarEvals[k,18] <- q.lagrange$LMlag$statistic
  HHCovarEvals[k,19] <- q.lagrange$LMlag$p.value  
  HHCovarEvals[k,20] <- q.lagrange$RLMlag$statistic
  HHCovarEvals[k,21] <- q.lagrange$RLMlag$p.value
  HHCovarEvals[k,22] <- q.lagrange$SARMA$statistic
  HHCovarEvals[k,23] <- q.lagrange$SARMA$p.value
  q.err <- errorsarlm(formula=as.formula(c("DeltaHHCGI~",hhcovarmodels[k])),data=deltasnna,q.listw,tol.solve=1.0e-18)
  HHCovarEvals[k,24] <- q.err$LL
  HHCovarEvals[k,25] <- 2*(q.err$parameters-2)-2*q.err$LL
  HHCovarEvals[k,26] <- -2*q.err$LL+(q.err$parameters-2)*log(nrow(deltasnna))
  HHCovarEvals[k,27] <- summary(q.err)$Wald1$statistic
  HHCovarEvals[k,28] <- summary(q.err)$LR1$statistic[1] 
  q.bptest <- bptest.sarlm(q.err)
  HHCovarEvals[k,29] <- q.bptest$statistic
  HHCovarEvals[k,30] <- q.bptest$p.value
  q.durbin <- lagsarlm(formula=as.formula(c("DeltaHHCGI~",hhcovarmodels[k])),data=deltasnna,q.listw,tol.solve=1.0e-18,type="mixed")
  durbin.test <- LR.sarlm(q.durbin,q.err)
  HHCovarEvals[k,31] <- durbin.test$statistic
  HHCovarEvals[k,32] <- 1-pchisq(durbin.test[[1]][1],q.lm$rank-1)
  HHCovarEvals[k,33] <- (summary(q.err)$LR1$statistic)[1]
  HHCovarEvals[k,34] <- (summary(q.err)$LR1$p.value)[1]
  q.lmtest <- lm.LMtests(q.err$residuals,q.listw,test="LMerr")
  HHCovarEvals[k,35] <- q.lmtest$LMerr$statistic[1,1]
  HHCovarEvals[k,36] <- q.lmtest$LMerr$p.value[1,1]
}

This loop takes hhcovarmodels, a list of the right-hand side of 5000 linear models (as characters, e.g. "Intercept + X + Y + Z", fits lm objects and writes some diagnostics to a data frame, and then fits errorsarlm objects (the Spatial Error Model) and writes some other diagnostics to the same data frame.

One suggestion I've seen elsewhere is to instantiate the output data frame to its full size before running the loop, as opposed to growing it by accretion. Would this be properly accomplished with something like

HHCovarEvals <- data.frame(matrix(ncol=36,row=5000))

?

I've also seen it suggested that any code that can be vectorized should be vectorized. I'm afraid that I'm not sure what this means in my own example. Are there components that could clearly be vectorized?
Does my repeated use of as.formula to take a character-formatted definition of the right-hand side of these models and fit lm and errorsarlm slow things down?

(For reference, with 5000 iterations, this operation takes about 3 hours.)

I've also seen it suggested that it may be faster to write the results as a vector and then add that vector to the data frame. I understand how and why this might work if the result is a column that needs to be added to the output data frame. In this instance, though, my loop adds rows to the data frame, and I can't see how I would add to the data frame, except with rbind, which seems inefficient.

My apologies if the code is so profoundly ugly that it causes you aesthetic or moral harm =) I'm a beginner.

Thanks in advance ...

Tried to post a short answer here but, due to formatting issues, I'll post it as an answer. If you need help with it let me know. — Oscar de León
– Oscar de León, Commented Feb 20, 2013 at 18:24
You should learn to use the R profiling tools. At the moment with no data to work with you're just asking for guesses. My guess is that it is the calls to regression functions and summary() calls that are using most of the cycles. — DWin
– DWin, Commented Feb 20, 2013 at 19:03
Thanks @Dwin - the approach described at this page is to use rprof([output file]) and rprof(NULL) around the code, and then consult the output file to see what's taking the most time. Are there other approaches you might recommend? — dubhousing
– dubhousing, Commented Feb 20, 2013 at 19:11
Operations on data.frames (subsetting, assignment,...) are slow. If all the results are numeric, you should use a matrix to store them. And definitely pre-allocate the output data structure. — Roland
– Roland, Commented Feb 21, 2013 at 8:26

Oscar de León · Accepted Answer · 2013-02-20 18:27:46Z

Instead of using a for loop you could write a function fitModels that executes all the tasks with any given model, and call it from

lapply(X=hhcovarmodels, FUN=fitModels)

Also, since it seems that each fit is independent of the rest, you could use a package that performs parallel computation, such as foreach via plyr:

library(package=plyr)
ldply(.data=hhcovarmodels, .fun=fitModels, .parallel=TRUE)

This way you can use more resources from the computer.

Explanation about suggesting plyr and parallelization

The way I see it, @dubhousing is basically fitting thousands of models and extracting the information from them. I am assuming this is what he should do, and since this is a question in stackoverflow and not in CrossValidated I am providing practical advice as how to go about doing it. I could bet that refining the model selection based on what is known about the data would reduce execution time (by cutting on model bulk), but that is not the issue at hand.

That said, there is not much to optimize in his code other than what he already mentioned (i.e. pre-allocating the dataframe and avoiding an explicit for loop). The other things you could tweak are the model fitting functions, but that is neither advisable nor in the scope of this answer.

So, issue at hand: fit thousands of models independent of each other, possibly on a local dataset, assuming that the bulk of computation is due to the large number of models and not to model fitting itself. We can think of each iteration of the function as a relatively small operation, which we can distribute over an array of processing units (lets think some three or four processes per core, the model fitting being not very computationally demanding).

Given that, in appearance, the code that would constitute fitModels is not further optimizeable, the next best alternative is parallel processing. Even if each run is non-optimal (or even arbitrarily slow), the serialized execution would be slower than a parallelized execution, since we assume that each run does not have a high overhead. even in the worst case (fitting only two models at a time, in parallel) would give speeds slightly larger than the serialized code.

All of this, of course, would be useless if each fit consumes too many resources (huge dataset or very high overhead).

This is where plyr comes into play. Although it increases overhead, it is a very convenient platform to provide the parallel backend with, as @DWin said, compact code.

Not being provided with example data and models, this is as far as I will go.

Suggestions to use plyr when the problem is speed seem dubious. It has serious advantages in compact expression, but rarely has it been a comptetitor in the speed department. — DWin
– DWin, Commented Feb 20, 2013 at 19:05
Thanks for your reply Oscar. Is it a general principle that a function encompassing a single iteration, when used with lapply, is more efficient than a loop of all iterations? — dubhousing
– dubhousing, Commented Feb 20, 2013 at 19:13
@Dwin After reading this post at librestats, I'm wondering if clusterapply might be a good approach for me, since I'm running Windows. Do you think that would be a promising approach? — dubhousing
– dubhousing, Commented Feb 20, 2013 at 19:14
To quote from that page: "Just remember to try basic optimization before you jump to parallelization. Slow serial code produces slow parallel code." — DWin
– DWin, Commented Feb 20, 2013 at 19:21
@DWin understood =) I will put one foot in front of another. — dubhousing
– dubhousing, Commented Feb 20, 2013 at 19:26

Stack Exchange Network

Which components of this r loop are inefficient?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Which components of this r loop are inefficient?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions