Yet Another plyr Example

Submitted by dylan on Thu, 2010-03-04 18:22.

another plyr exampleanother plyr example quantiles (0.05, 0.25, 0.5, 0.75, 0.95) of DSC by temperature bin

There are plenty of good examples on how to use functions from the plyr package. Here is one more, demonstrating how to use ddply with a custom function. Note that there are two places where the example function may blow up if you pass in poorly formatted or strange data: calls to 1) t.test() and 2) quantile(). Also note the use of the transpose function, t(), for converting column-wise data into row-wise data-- suitable for inclusion into a dataframe containing a single row.

 
Example Code

#libraries
library(plyr)
library(lattice)

# simulate DSC data from several files
r <- data.frame(temp=rep(1:100, times=2), dsc=rnorm(200))
r$file <- factor(rep(c('file 1','file 2'), each=100))

# bin temperature in 5 degree slices
groups <- seq(0, 100, by=5)
r$group <- cut(r$temp, groups)

# custom summary function
# updated to work with arbitrary column names
f <- function(i, column)
        {
       
        # conf interval
        # careful with this t.test -- may blow up with some datasets
        i.conf <- data.frame(t(t.test(i[, column], conf.level=0.95, na.action='na.omit')$conf.int))
        names(i.conf) <- c('lower', 'upper')
       
        # quantiles
        p <- c(0.05, 0.25, 0.5, 0.75, 0.95)
        i.quant <- data.frame(t(quantile(i[, column], probs=p, na.rm=TRUE)))
        names(i.quant) <- paste('q', round(p * 100), sep='_')
       
        # make a dataframe
        d <- data.frame(
        mean=mean(i[, column], na.rm=TRUE),
        min=min(i[, column], na.rm=TRUE),
        max=max(i[, column], na.rm=TRUE),
        sd=sd(i[, column], na.rm=TRUE),
        i.quant,
        i.conf
        )
       
        # give back to caller
        return(d)
        }


# apply our function by groups
r.agg <- ddply(r, .(group), .fun=f, column='dsc')

# visualization of some of the results
xyplot(q_50 + q_5 + q_95 ~ group, data=r.agg, type='l', lty=c(1,2,2), col=1)
xyplot(q_5 + q_25 + q_50 + q_75 + q_95 ~ group, data=r.agg, type='l', lty=c(3,2,1,2,3), lwd=c(1,1,2,1,1), col=1)
xyplot(mean + lower + upper ~ group, data=r.agg, type='l', lty=c(1,2,2), col=1)
xyplot(mean + lower + upper ~ group, data=r.agg, type='p', cex=c(1,2,2), pch=c('o','-','-'), col=1)

( categories: )

Instead of using: #

Instead of using:

# make a dataframe
d <- data.frame(
mean=mean(i$dsc, na.rm=TRUE),
min=min(i$dsc, na.rm=TRUE),
max=max(i$dsc, na.rm=TRUE),
sd=sd(i$dsc, na.rm=TRUE),
i.quant,
i.conf
)

you could utilize colwise() function, also available in plyr package. You should modify i.quant and i.conf objects in order to do such thing.

RE: colwise()

Thanks for the tip. I think that colwise() will only work with functions that return a vector... However, the summarise() function from plyr will get us almost there, with less code:

ddply(r, .(group), summarise, max=max(dsc), min=min(dsc), mean=mean(dsc), q_05=quantile(dsc, p=0.05), q_95=quantile(dsc, p=0.95))

      group      max       min        mean      q_05      q_95
1    (0,10] 1.882492 -2.071292  0.18583434 -1.481344 1.4763494
2   (10,20] 1.913237 -1.705417  0.06888515 -1.276185 1.3142561
3   (20,30] 1.711792 -2.046275 -0.00865485 -1.895729 1.5027731
4   (30,40] 1.616174 -2.659262 -0.04254148 -2.106023 1.2285925
5   (40,50] 2.301109 -2.040825  0.32765215 -1.123423 1.8267938
6   (50,60] 1.175290 -2.520839 -0.49127679 -2.137796 0.7901546
7   (60,70] 1.718479 -2.331585 -0.14320696 -1.739600 1.5144656
8   (70,80] 1.641169 -1.528282 -0.01077159 -1.306301 1.2659325
9   (80,90] 2.560076 -1.157201  0.20236340 -1.098758 1.7695491
10 (90,100] 1.062150 -2.324590 -0.67143682 -2.206479 0.7240225