Computing Statistics from Poorly Formatted Data (plyr and reshape packages for R)
Jul 9, 2009 metroadminPremise
I was recently asked to verify the coefficients of a linear model fit to sets of data, where each row of the input file was a "site" and each column contained the dependent variable through time (i.e. column 1 = time step 1, column 2 = time step 2, etc.). This format is cumbersome in that it cannot be directly fed into the R lm() function for linear model fitting. Furthermore, we needed the output formatted with columns containing slope, intercept, and R-squared values for each site (rows). All of the re-formatting, and model fitting can be done by hand, using basic R functions, however this task seemed like a good case study for the use of the reshape and plyr packages for R. The reshape package can be used to convert between "wide" and "long" format-- the first step in the example presented below. The plyr package can be used to split a data set into subsets (based on a grouping factor), apply an arbitrary function to the subset, and finally return the combined results in several possible formats. The original input data, desired output, and R code are listed below.
Input
Output
Add required libraries and load example data files
Reshape data
Visually check patterns
Extract linear model terms and R-squared for each subset
Attachments:
Links:
Comparison of Slope and Intercept Terms for Multi-Level Model II: Using Contrasts
R: advanced statistical package
Creating a Custom Panel Function (R - Lattice Graphics)