dylan's blog

Interesting R Packages

Submitted by dylan on Mon, 2009-04-27 15:47.

CHNOSZ Chemical thermodynamics library and activity diagram software


( categories: )

Comparison of Slope and Intercept Terms for Multi-Level Model II: Using Contrasts

Submitted by dylan on Tue, 2009-02-17 04:43.


Small update to a similar thread from last week, on the comparison of slope and intercept terms fit to a multi-level model. I finally figured out (thanks R-Help mailing list!) how to efficiently use contrasts in R. The C() function can be called within a model formula, to reset the base level of an un-ordered factor. The UCLA Stats Library has an extensive description of this topic here. This approach can be used to sequentially test for differences between slope and intercept terms from a multi-level model, by re-setting the base level of a factor. See example data and figure below.

Note that the multcomp package has a much more robust approach to this type of operation. Details below.

Example Multi-Level Data

# need these

# replicate an important experimental dataset
x <- rnorm(100)
y1 <- x[1:25] * 2 + rnorm(25, mean=1)
y2 <- x[26:50] * 2.6 + rnorm(25, mean=1.5)
y3 <- x[51:75] * 2.9 + rnorm(25, mean=5)
y4 <- x[76:100] * 3.5 + rnorm(25, mean=5.5)
d <- data.frame(x=x, y=c(y1,y2,y3,y4), f=factor(rep(letters[1:4], each=25)))

# plot
xyplot(y ~ x, groups=f, data=d,
auto.key=list(columns=4, title='Beard Type', lines=TRUE, points=FALSE, cex=0.75),
type=c('p','r'), ylab='Number of Pirates', xlab='Distance from Land')

Example Multi-Level Model IIExample Multi-Level Model II

( categories: )

Aggregating Soil Survey Information: Available Water Holding Capacity

Submitted by dylan on Sat, 2009-01-31 23:53.

4km Grid of AWC: generated using PostGIS/GRASS, based on USDA-NCSS SSURGO data.4km Grid of AWC: generated using PostGIS/GRASS, based on USDA-NCSS SSURGO data.

Horizon thickness-weighted mean AWC (available water holding capacity), aggregated to a 4km grid, based on the detailed (SSURGO) soil survey database. Each grid cell is the component percentage / area fraction weighted mean of profile AWC. The variation in AWC tracks several important parent material induced patterns: with lower AWC in residual soils formed on steep granitic terrain (south flank of Sierra Nevada), and higher AWC in residual soils formed on the gentler slopes of meta-volcanic and meta-sedimentary terrain (central and northern flanks of Sierra Nevada). The higher AWC values one the east side of the San Joaquin Valley correspond with the characteristically finer soils formed from coast range alluvium. High AWC values of the Sacramento Valley correspond with the fine textured soils derived from a mixture of coast range alluvium, and meta-volcanic/sedimentary alluvium from the Sierra Nevada.

( categories: )

Comparison of Slope and Intercept Terms for Multi-Level Model

Submitted by dylan on Thu, 2009-01-29 18:23.


When the relationship between two variable is (potentially) dependent on a third, categorical variable ANCOVA (analysis of covariance), or some variant, is commonly used. There are several approaches to testing for differences in slope/intercepts (in the case of a simple linear model) between levels of the stratifying variable. In R the following formula notation is usually used to test for interaction between levels of a factor (f) and the relationship between two continuous variables x and y: y ~ x * f. A simple graphical exploration of this type of model can be done through examination of confidence intervals computed for slope and intercept terms, for each level of our grouping factor (f). An example of a fictitious dataset is presented below. Note that this a rough approximation for testing differences in slope/intercept within a multi-level model. A more robust approach would take into account that we are trying to make several pair-wise comparisons, i.e. something akin to Tukey's HSD. Something like this can be done with the multcomp package. For any real data set you should always consult a real statistician.

Example Multi-Level Model: each panel represents a model fit to y ~ x, for group fExample Multi-Level Model: each panel represents a model fit to y ~ x, for group f

Example Multi-Level Data

# need this for xyplot()

# make some fake data:
x <- rnorm(100, mean=3, sd=6)
y <- x * runif(100, min=1, max=7) + runif(100, min=1.8, max=5)
d <- data.frame(x, y, f=rep(letters[1:10], each=10))

# check it out
xyplot(y ~ x | f, data=d, type=c('p','r'))

( categories: )

Some Ideas on Interpolation of Categorical Data

Submitted by dylan on Thu, 2009-01-15 04:36.



Wanted to make something akin to an interpolated surface for some spatially auto-correlated categorical data (presence/absence). I quickly generated some fake spatially auto-correlated data to work with using r.surf.fractal in GRASS. These data were converted into a binary map using an arbitrary threshold that looked about right-- splitting the data into something that looked 'spatially clumpy'.

Categorical Interpolation 1: Simulated auto-correlated, categorical variable, with sampling points and derived voronoi polygons.Fig. 1: Simulated auto-correlated, categorical variable, with sampling points and derived voronoi polygons.

I had used voronoi polygons in the past to display connectivity of categorical data recorded at points, even though sparsely sampled areas tend to be over emphasized. Figure 1 shows the fake spatially auto-correlated data (grey = presence /white = not present), sample points (yellow boxes), and voronoi polygons. The polygons with thicker, red boundaries represent the "voronoi interpolation" of the categorical feature.

( categories: )

Some Interesting Links

Submitted by dylan on Tue, 2008-12-30 02:45.

Found some interesting material today, mostly related to evaluation of statistical tests and such.

( categories: )

Python Image Module Example: How much ink on that page?

Submitted by dylan on Fri, 2008-12-26 21:26.

Thought it would be fun to compute how much ink a given poster requires, per unit area of paper, when sending to the department large-format printer. The Python Imaging Library provides several modules suitable for low-level operation on image data. A simple (and probably very inefficient) script was developed to compute the white/black percentage of an image. A script like this could be used to adjust a per-poster "ink cost", which would hopefully prevent people from wasting ink. Obviously, this computation is scale-dependent, so standardized rasterization parameters would have to be set in order for the "ink cost" calculation to be fair. More generalized or efficient approaches are always welcomed.

( categories: )

Traveling Salesman Approach to Visiting Data-loggers II

Submitted by dylan on Tue, 2008-12-23 01:46.

Updated version of a previous attempt to use a traveling-salesman approach to visiting data-loggers in steep terrain. This time we use a vector/raster approach (v.rast.stats) to compute the total "cost" incurred along each segment of a network connecting all of our data-loggers. Although this approach requires more computation time (v.rast.stats is slow), we can generate a more detailed cost surface to model movement. In this case we are using a cumulative-traversed-slope derived cost, with a 90% reduction along a local road.

Traveling Salesman Approach to Visiting Data-loggers

Submitted by dylan on Sat, 2008-12-20 06:44.

Optimal Route in 3DOptimal Route in 3D

Several data-loggers in steep terrain; how can we visit all of them in the most efficient manner such that we traverse the least cumulative slope gradient? In a sense this is a traveling salesman style problem, with travel costs defined by slope gradient. The v.net.salesman module in GRASS can be used to compute the optimal route between data-loggers, given: 1) a network of lines connecting all points to be visited and 2) transferal of slope cost information to the network. An example approach is listed below. Note that due to a bug in v.net.salesman, we need to use one extra step-- details below.

Additive Time Series Decomposition in R: Soil Moisture and Temperature Data

Submitted by dylan on Mon, 2008-10-27 16:08.

Decagon Sensors: EC-5 (moisture) and ECT (temperature)

Simple demonstration of working with time-series data collected from Decagon Devices soil moisture and temperature sensors. These sensors were installed in a potted plant, that was semi-regularly watered, and data were collected for about 80 days on an hourly basis. Several basic operations in R are demonstrated:

  • reading raw data in CSV format
  • converting date-time values to R's date-time format
  • applying a calibration curve to raw sensor values
  • initialization of R time series objects
  • seasonal decomposition of additive time series (trend extraction)
  • plotting of decomposed time series, ACF, and cross-ACF