7

Why aggregate() doesn't work here?

> aggregate(cbind(var1 = 1:10, var2 = 101:110), 
      by=list(range=cut(1:10, breaks=c(2,4,8,10))), 
      FUN = function(x) 
        { 
        c(obs=length(x[, "var2"]), avg=mean(x[, "var2"]), sd=dev(x[, "var2"])) 
        })

Error in x[, "var2"] (from #1) : incorrect number of dimensions

> cbind(var1 = 1:10, var2 = 101:110)[, "var2"]
 [1] 101 102 103 104 105 106 107 108 109 110

UPDATE

Returned aggregate() values after running the correct version:

> r = aggregate(data.frame(var1 = 1:10, var2 = 101:110), by=list(range=cut(1:10, breaks=c(2,4,8,10))), FUN = function(x) { c(obs=length(x), avg=mean(x), sd=sd(x)) })
> class(r)
[1] "data.frame"
> dim(r)
[1] 3 3
> r[,1]
[1] (2,4]  (4,8]  (8,10]
Levels: (2,4] (4,8] (8,10]
> r[,2]
     obs avg       sd
[1,]   2 3.5 0.707107
[2,]   4 6.5 1.290994
[3,]   2 9.5 0.707107
> r[,3]
     obs   avg       sd
[1,]   2 103.5 0.707107
[2,]   4 106.5 1.290994
[3,]   2 109.5 0.707107
> class(r[,2])
[1] "matrix"
> class(r[,3])
[1] "matrix"
1
  • 1
    cbind with numeric arguments returns a matrix, not a a dataframe. And you would not expect to specify column names inside the anonymous function supplied to FUN. Commented Apr 27, 2015 at 19:13

2 Answers 2

3

Supply a dataframe and understand that aggregate passes only column vectors so using x[ , "colname"] is doomed because "x" is not a dataframe:

 aggregate(data.frame(var1 = 1:10, var2 = 101:110), 
       by=list(range=cut(1:10, breaks=c(2,4,8,10))), 
       FUN = function(x) 
         { 
         c(obs=length(x), avg=mean(x), sd=sd(x)) 
         })
#------------
   range  var1.obs  var1.avg   var1.sd    var2.obs    var2.avg     var2.sd
1  (2,4] 2.0000000 3.5000000 0.7071068   2.0000000 103.5000000   0.7071068
2  (4,8] 4.0000000 6.5000000 1.2909944   4.0000000 106.5000000   1.2909944
3 (8,10] 2.0000000 9.5000000 0.7071068   2.0000000 109.5000000   0.7071068
Sign up to request clarification or add additional context in comments.

6 Comments

Is "x" a matrix? I wouldn't know how to break into that part of the code to inspect the objects.
"x" would have been a (possibly named) numeric (atomic) vector at the point it/they were being passed to FUN. It would not have had a dimension so it would be neither a matrix nor a dataframe.
Interesting, so how do we end up with a length/mean/sd for each column in the original data.frame object (var1/var2)? If "x" is a simple vector, does it mean FUN is called once for each data.frame column?
FUN is called as many times as there are categories in the INDEX (rather by) argument for each column. That the entire reason for aggregate's existence. So FUN is called length(dfrm) * length(unique(by-vector)) times.
aggregate returns a first label column as the vector of unique (sorted) values in the by-argument and then basically rbinds the values from the multiple calls to FUN for each column. The function doing the actual "rbinding" is sapply.
|
3

That's because aggregate doesn't pass data.frames to its FUN= argument. It passes the vector of observations. Also, [, "name"] indexing doesn't work with matrices. Make sure you pass in a data.frame and not a matrix as in your example. Perhaps you want the by function instead

by(data.frame(var1 = 1:10, var2 = 101:110), 
    list(range=cut(1:10, breaks=c(2,4,8,10))), 
    FUN = function(x) { c(obs=length(x[, "var2"]), avg=mean(x[, "var2"]), sd=sd(x[, "var2"])) })

2 Comments

I checked the aggregate code, it converts a matrix parameter to a data.frame if it's not a time series object. Which "vector of observations" FUN takes exactly?
It passes in columns as vectors. It only ever operates on one column at a time.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.