Computing Statistics from Poorly Formatted Data (plyr and reshape packages for R)

Posted: Thursday, July 9th, 2009

Premise
I was recently asked to verify the coefficients of a linear model fit to sets of data, where each row of the input file was a "site" and each column contained the dependent variable through time (i.e. column 1 = time step 1, column 2 = time step 2, etc.). This format is cumbersome in that it cannot be directly fed into the R lm() function for linear model fitting. Furthermore, we needed the output formatted with columns containing slope, intercept, and R-squared values for each site (rows). All of the re-formatting, and model fitting can be done by hand, using basic R functions, however this task seemed like a good case study for the use of the reshape and plyr packages for R. The reshape package can be used to convert between "wide" and "long" format-- the first step in the example presented below. The plyr package can be used to split a data set into subsets (based on a grouping factor), apply an arbitrary function to the subset, and finally return the combined results in several possible formats. The original input data, desired output, and R code are listed below.

Input

```2.521 2.312 2.720 2.254 * 2.922 * * 2.291 2.038 * * 1.151
1.675 1.646 1.860 2.517 * 1.986 * * 3.279 3.420 * * 3.059
1.734 1.305 1.774 2.366 * 2.909 * * 2.863 2.958 * * 2.973
1.637 1.632 2.040 1.807 * 1.889 * * 2.081 2.267 * * 2.655
1.967 8.307 8.331 8.698 * 8.236 * * 7.990 8.255 * * 8.041
1.670 1.744 1.982 2.029 * 2.159 * * 3.330 2.945 * * 3.301
1.668 1.816 1.832 2.100 * 2.289 * * 2.745 2.703 * * 3.216
2.304 2.413 2.749 2.827 * 2.978 * * 3.011 3.244 * * 4.494
1.505 2.827 3.375 1.923 * 4.250 * * 1.542 3.094 * * 1.480```

Output

``` site intercept slope Rsq
1 1 2.8123 -0.0894 0.5115
2 2 1.5229 0.1512 0.7682
3 3 1.5499 0.1351 0.7445
4 4 1.5581 0.0738 0.8453
5 5 6.1738 0.2174 0.1727
6 6 1.4787 0.1527 0.9026
7 7 1.5340 0.1270 0.9871
8 8 2.1403 0.1437 0.8224
9 9 2.7546 -0.0425 0.0306```

```# these may need to be installed with install.packages()
library(plyr)
library(reshape)
# this one comes with the base install of R
library(lattice)

# read in the data as pasted from original format

Reshape data

```# transpose and convert to long format
d1.long <- melt(t(d1))
d2.long <- melt(t(d2))

# give resonable names
names(d1.long) <- c('obs','site','value')
names(d2.long) <- c('obs','site','value')

#  1:n obs * m sites
d1.long\$time <- rep(1:13, 9)
d2.long\$time <- rep(1:8, 7)```

Visually check patterns

```xyplot(value ~ time | factor(site), type=c('p','r'), data=d1.long, as.table=TRUE)
xyplot(value ~ time | factor(site), type=c('p','r'), data=d2.long, as.table=TRUE)```
```xyplot(value ~ time | factor(site), type=c('p','r'), data=d1.long, as.table=TRUE)
xyplot(value ~ time | factor(site), type=c('p','r'), data=d2.long, as.table=TRUE)```

Extract linear model terms and R-squared for each subset

```fit.summary <- function(i)
{
# fit linear model to this set of the data
l <- lm(value ~ time, data=i)
# extract model terms
l.coef <- coef(l)
# extract R-squared
l.rsq <- summary(l)\$r.squared
# combine model details into a single vector
l.details <- c(l.coef, l.rsq)
# rename elements of the vector
names(l.details) <- c('intercept', 'slope', 'Rsq')
# return rounded values to the calling function
return(round(l.details, 4))
}

# compute lm details by site
ddply(d1.long, .(site), fit.summary)
ddply(d2.long, .(site), fit.summary)```

d1.txt

d2.txt