Computing Statistics from Poorly Formatted Data (plyr and reshape packages for R)

Posted: Thursday, July 9th, 2009

Premise
I was recently asked to verify the coefficients of a linear model fit to sets of data, where each row of the input file was a "site" and each column contained the dependent variable through time (i.e. column 1 = time step 1, column 2 = time step 2, etc.). This format is cumbersome in that it cannot be directly fed into the R lm() function for linear model fitting. Furthermore, we needed the output formatted with columns containing slope, intercept, and R-squared values for each site (rows). All of the re-formatting, and model fitting can be done by hand, using basic R functions, however this task seemed like a good case study for the use of the reshape and plyr packages for R. The reshape package can be used to convert between "wide" and "long" format-- the first step in the example presented below. The plyr package can be used to split a data set into subsets (based on a grouping factor), apply an arbitrary function to the subset, and finally return the combined results in several possible formats. The original input data, desired output, and R code are listed below.

Input

2.521 2.312 2.720 2.254 * 2.922 * * 2.291 2.038 * * 1.151
1.675 1.646 1.860 2.517 * 1.986 * * 3.279 3.420 * * 3.059
1.734 1.305 1.774 2.366 * 2.909 * * 2.863 2.958 * * 2.973
1.637 1.632 2.040 1.807 * 1.889 * * 2.081 2.267 * * 2.655
1.967 8.307 8.331 8.698 * 8.236 * * 7.990 8.255 * * 8.041
1.670 1.744 1.982 2.029 * 2.159 * * 3.330 2.945 * * 3.301
1.668 1.816 1.832 2.100 * 2.289 * * 2.745 2.703 * * 3.216
2.304 2.413 2.749 2.827 * 2.978 * * 3.011 3.244 * * 4.494
1.505 2.827 3.375 1.923 * 4.250 * * 1.542 3.094 * * 1.480

Output

 site intercept slope Rsq
1 1 2.8123 -0.0894 0.5115
2 2 1.5229 0.1512 0.7682
3 3 1.5499 0.1351 0.7445
4 4 1.5581 0.0738 0.8453
5 5 6.1738 0.2174 0.1727
6 6 1.4787 0.1527 0.9026
7 7 1.5340 0.1270 0.9871
8 8 2.1403 0.1437 0.8224
9 9 2.7546 -0.0425 0.0306

Add required libraries and load example data files

# these may need to be installed with install.packages()
library(plyr)
library(reshape)
# this one comes with the base install of R
library(lattice)
 
# read in the data as pasted from original format
d1 <- read.table('d1.txt', na.strings='*')
d2 <- read.table('d2.txt')

Reshape data

# transpose and convert to long format
d1.long <- melt(t(d1))
d2.long <- melt(t(d2))
 
# give resonable names
names(d1.long) <- c('obs','site','value')
names(d2.long) <- c('obs','site','value')
 
# add time variable
#  1:n obs * m sites
d1.long$time <- rep(1:13, 9)
d2.long$time <- rep(1:8, 7)

Visually check patterns

xyplot(value ~ time | factor(site), type=c('p','r'), data=d1.long, as.table=TRUE)
xyplot(value ~ time | factor(site), type=c('p','r'), data=d2.long, as.table=TRUE)
xyplot(value ~ time | factor(site), type=c('p','r'), data=d1.long, as.table=TRUE)
xyplot(value ~ time | factor(site), type=c('p','r'), data=d2.long, as.table=TRUE)

Extract linear model terms and R-squared for each subset

fit.summary <- function(i)
{
        # fit linear model to this set of the data
        l <- lm(value ~ time, data=i)
        # extract model terms
        l.coef <- coef(l)
        # extract R-squared
        l.rsq <- summary(l)$r.squared
        # combine model details into a single vector
        l.details <- c(l.coef, l.rsq)
        # rename elements of the vector
        names(l.details) <- c('intercept', 'slope', 'Rsq')
        # return rounded values to the calling function
        return(round(l.details, 4))
}
 
# compute lm details by site
ddply(d1.long, .(site), fit.summary)
ddply(d2.long, .(site), fit.summary)

Attachments:

d1.txt

d2.txt

Links:

Comparison of Slope and Intercept Terms for Multi-Level Model II: Using Contrasts

R: advanced statistical package

Creating a Custom Panel Function (R - Lattice Graphics)