Computing Statistics from Poorly Formatted Data (plyr and reshape packages for R)
Jul 9, 2009 metroadminPremise
I was recently asked to verify the coefficients of a linear model fit to sets of data, where each row of the input file was a "site" and each column contained the dependent variable through time (i.e. column 1 = time step 1, column 2 = time step 2, etc.). This format is cumbersome in that it cannot be directly fed into the R lm() function for linear model fitting. Furthermore, we needed the output formatted with columns containing slope, intercept, and R-squared values for each site (rows). All of the re-formatting, and model fitting can be done by hand, using basic R functions, however this task seemed like a good case study for the use of the reshape and plyr packages for R. The reshape package can be used to convert between "wide" and "long" format-- the first step in the example presented below. The plyr package can be used to split a data set into subsets (based on a grouping factor), apply an arbitrary function to the subset, and finally return the combined results in several possible formats. The original input data, desired output, and R code are listed below.
Input
2.521 2.312 2.720 2.254 * 2.922 * * 2.291 2.038 * * 1.1511.675 1.646 1.860 2.517 * 1.986 * * 3.279 3.420 * * 3.0591.734 1.305 1.774 2.366 * 2.909 * * 2.863 2.958 * * 2.9731.637 1.632 2.040 1.807 * 1.889 * * 2.081 2.267 * * 2.6551.967 8.307 8.331 8.698 * 8.236 * * 7.990 8.255 * * 8.0411.670 1.744 1.982 2.029 * 2.159 * * 3.330 2.945 * * 3.3011.668 1.816 1.832 2.100 * 2.289 * * 2.745 2.703 * * 3.2162.304 2.413 2.749 2.827 * 2.978 * * 3.011 3.244 * * 4.4941.505 2.827 3.375 1.923 * 4.250 * * 1.542 3.094 * * 1.480X
Output
site intercept slope Rsq1 1 2.8123 -0.0894 0.51152 2 1.5229 0.1512 0.76823 3 1.5499 0.1351 0.74454 4 1.5581 0.0738 0.84535 5 6.1738 0.2174 0.17276 6 1.4787 0.1527 0.90267 7 1.5340 0.1270 0.98718 8 2.1403 0.1437 0.82249 9 2.7546 -0.0425 0.0306X
Add required libraries and load example data files
# these may need to be installed with install.packages()library(plyr)library(reshape)# this one comes with the base install of Rlibrary(lattice)# read in the data as pasted from original formatd1 <- read.table('d1.txt', na.strings='*')d2 <- read.table('d2.txt')X
Reshape data
# transpose and convert to long formatd1.long <- melt(t(d1))d2.long <- melt(t(d2))# give resonable namesnames(d1.long) <- c('obs','site','value')names(d2.long) <- c('obs','site','value')# add time variable# 1:n obs * m sitesd1.long$time <- rep(1:13, 9)d2.long$time <- rep(1:8, 7)X
Visually check patterns
xyplot(value ~ time | factor(site), type=c('p','r'), data=d1.long, as.table=TRUE)xyplot(value ~ time | factor(site), type=c('p','r'), data=d2.long, as.table=TRUE)X
xyplot(value ~ time | factor(site), type=c('p','r'), data=d1.long, as.table=TRUE)xyplot(value ~ time | factor(site), type=c('p','r'), data=d2.long, as.table=TRUE)X
Extract linear model terms and R-squared for each subset
fit.summary <- function(i){# fit linear model to this set of the datal <- lm(value ~ time, data=i)# extract model termsl.coef <- coef(l)# extract R-squaredl.rsq <- summary(l)$r.squared# combine model details into a single vectorl.details <- c(l.coef, l.rsq)# rename elements of the vectornames(l.details) <- c('intercept', 'slope', 'Rsq')# return rounded values to the calling functionreturn(round(l.details, 4))}# compute lm details by siteddply(d1.long, .(site), fit.summary)ddply(d2.long, .(site), fit.summary)X
Attachments:
Links:
Comparison of Slope and Intercept Terms for Multi-Level Model II: Using Contrasts
R: advanced statistical package
Creating a Custom Panel Function (R - Lattice Graphics)