Open Source Software Tools for Soil Scientists

 
Abstract:
A collection of articles, methods, and examples of how open source software tools can assist with research in the field of soil science. Additional links to methods in environmental sciences within the realm of statistics, interpolation, morphometrics, and more.

 
UCD Resources:
Garrett and I are leading a seminar-style course covering some of this material, Fall quarter, 2006.

 
Articles:

 
Reference Material

General Purpose Programming with Scripting Languages

 
Some initial hints of use of various scripting languages.

 
Python

 
Ruby

Python Image Module Example: How much ink on that page?

 
Premise:
Thought it would be fun to compute how much ink a given poster requires, per unit area of paper, when sending to the department large-format printer. The Python Imaging Library provides several modules suitable for low-level operation on image data. A simple (and probably very inefficient) script was developed to compute the white/black percentage of an image. A script like this could be used to adjust a per-poster "ink cost", which would hopefully prevent people from wasting ink. Obviously, this computation is scale-dependent, so standardized rasterization parameters would have to be set in order for the "ink cost" calculation to be fair. More generalized or efficient approaches are always welcomed.

 
Implementation: (when copying/pasting note whitespace in blocks)

#!/usr/bin/env python

# invoke like this: ./white.py image.png
# idea from http://www.halfbakery.com/user/lurch
import sys
import Image

# cheap and non-robust way to access first argument
file = sys.argv[1]

# open the image, load into memory
pic = Image.open( file )
imgdata = pic.load()

# extract the dimensions of the image
xsize, ysize = pic.size

# init some counters
black = 0
white = 0

# define some colors (R,G,B)
b_color = (0,0,0)
w_color = (255,255,255)

# loop over columns
for x in xrange(xsize):
        # loop over rows
        for y in xrange(ysize):
                if imgdata[x,y] == w_color:
                        white+=1
                if imgdata[x,y] == b_color:
                        black+=1
         

# compute percentages
total_pixels = xsize * ysize
pct_white = (float(white) / float(total_pixels)) * 100
pct_black = (float(black) / float(total_pixels)) * 100

# formatted printing
print "%.3f%% of image is white!" % (pct_white)
print "%.3f%% of image is black!" % (pct_black)

Simple Python Interface to USGS TNM Elevation Service

 
Premise
Wanted a simpler way to access the USGS seamless elevation look-up service. Python seemed like a logical start. Note that the response from the USGS webservice is not correctly identified as valid XML by the python XML-parser. Therefore there is a small amount of scrubbing used to coerce the response into valid XML. Comments on why this is, or is not, a good idea are welcome. Here is a link to an interactive version.
 
Update It looks like the USGS service does not accept POST-style requests. I have made some small changes to the script below.

 
Example Listing

# nice CLI interface
import sys
from optparse import OptionParser

# url access
import urllib, urllib2

# simple XML parsing
from xml.dom import minidom

# CSV file parsing
import csv


# define service URL
base_url = 'http://gisdata.usgs.gov/XMLWebServices/TNM_Elevation_Service.asmx/getElevation?X_Value=%f&Y_Value=%f&Elevation_Units=meters&Source_Layer=-1&Elevation_Only=1'


#process command line (optparse)
parser = OptionParser()
parser.add_option("-f", "--file", dest="infile", help="input csv file containing WGS84 (lon,lat,site_id) record", metavar="FILE")


# process args
(options, args) = parser.parse_args()

#require an input file
if not options.infile:
        print "ERROR: must supply an input file!"
        sys.exit(1)


#open input file
try:
        infile = options.infile
        reader = csv.reader(open(infile, "rb"))
except:
        print "ERROR: Cannot open: " + infile
        sys.exit(1)


# read the csv file
for row in reader:
        lon = float(row[0])
        lat = float(row[1])
        site_id = row[2]
       
        # the base URL of the web service
        url = 'http://gisdata.usgs.gov/XMLWebServices/TNM_Elevation_Service.asmx/getElevation'
       
        # url GET args
        values = {'X_Value' : lon,
        'Y_Value' : lat,
        'Elevation_Units' : 'meters',
        'Source_Layer' : '-1',
        'Elevation_Only' : '1', }
       
        # make some fake headers, with a user-agent that will
        # not be rejected by bone-headed servers
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = {'User-Agent' : user_agent}
       
        # encode the GET arguments
        data = urllib.urlencode(values)
       
        # make the URL into a qualified GET statement:
        get_url = url + '?' + data
       
        # make the request: note that by ommitting the url arguments
        # we force a GET request, instead of a POST
        req = urllib2.Request(url=get_url, headers=headers)
        response = urllib2.urlopen(req)
        the_page = response.read()
       
        # convert the HTML back into plain XML
        for entity, char in (('lt', '<'), ('gt', '>'), ('amp', '&')):
                the_page = the_page.replace('&%s;' % entity, char)

        # clean some cruft... XML won't parse with this stuff in there...
        the_page = the_page.replace('<string xmlns="http://gisdata.usgs.gov/XMLWebServices/">', '')
        the_page = the_page.replace('<?xml version="1.0" encoding="utf-8"?>\r\n', '')
        the_page = the_page.replace('</string>', '')
        the_page = the_page.replace('<!-- Elevation Values of -1.79769313486231E+308 (Negative Exponential Value) may mean the data source does not have values at that point.  --> <USGS_Elevation_Web_Service_Query>', '')
       
        # parse the cleaned XML
        dom = minidom.parseString(the_page)
        children = dom.getElementsByTagName('Elevation_Query')[0]
       
        # extract the interesting parts
        elev = float(children.getElementsByTagName('Elevation')[0].firstChild.data)
        data_source = children.getElementsByTagName('Data_Source')[0].firstChild.data
       
        # print to stdout
        print "%f,%f,%s,%f,%s" % (lon, lat, site_id, elev, data_source)

 
Incantation

python elev_lookup.py -f ca_cities.txt

 
Input File CA cities in WGS84 (lon,lat,id) records

-122.32,37.78,Alameda NAS
-120.53,41.48,Alturas
-124.1,40.98,Arcata
-119.05,35.43,Bakersfield
-121.45,39.13,Beale AFB
-116.95,33.93,Beaumont
-116.62,35.28,Bicycle Lk
-116.68,34.27,Big Bear Apt
-118.6,37.60,Bishop
-120.7,39.28,Blue Canyon
-114.72,33.62,Blythe
-118.37,34.20,Burbank

 
Output

-122.320000,37.780000,Alameda NAS,2.123548,NED Contiguous U. S. 1/3W arc second elevation data
-120.530000,41.480000,Alturas,1331.056396,NED Contiguous U. S. 1/3W arc second elevation data
-124.100000,40.980000,Arcata,63.749836,NED Contiguous U. S. 1/3W arc second elevation data
-119.050000,35.430000,Bakersfield,148.473618,NED Contiguous U. S. 1/3W arc second elevation data
-121.450000,39.130000,Beale AFB,29.950783,NED Contiguous U. S. 1/3W arc second elevation data
-116.950000,33.930000,Beaumont,792.840881,NED Contiguous U. S. 1/3W arc second elevation data
-116.620000,35.280000,Bicycle Lk,711.765869,NED Contiguous U. S. 1/3W arc second elevation data
-116.680000,34.270000,Big Bear Apt,1697.037720,NED Contiguous U. S. 1/3W arc second elevation data
-118.600000,37.600000,Bishop,2209.536865,NED Contiguous U. S. 1/3W arc second elevation data
-120.700000,39.280000,Blue Canyon,1603.887573,NED Contiguous U. S. 1/3W arc second elevation data
-114.720000,33.620000,Blythe,120.626007,NED Contiguous U. S. 1/3W arc second elevation data
-118.370000,34.200000,Burbank,224.287033,NED Contiguous U. S. 1/3W arc second elevation data
-117.350000,33.300000,Camp Pendlet,21.895588,NED Contiguous U. S. 1/3W arc second elevation data

Text processing with the UNIX toolbox

 
Working with DBF files

  • Extract the text contents of a DBF file with dbfxtrct, select out the 3rd and 2nd columns with awk, keep only unique rows with sort and uniq, and finally saving to a new file called 'ca_fips-name_list.txt' with shell redirection.
    # dbfxtrct utility
    dbfxtrct -ica_counties_aea83.dbf | awk -F"," '{print $3","$2}' | sort -g | uniq > ca_fips-name_list.txt
  • Other DBF tools include:

    • dbf2mysql
    • dbf2pg
    • dbfadd
    • dbfcreate
    • dbfdump
    • dbfutil1
    • dbfxtrct

Graphviz

http://www.graphviz.org/

 
Example DOT Code:

digraph soil {
        size="8,8";
        node [fontsize=12, color = lightblue2, style = filled];
       
        "solar radiation budget" [color = yellow];
        "Cs log(As)" [color = yellow];
        "landscape position" [color = yellow];
        "aspect" [color = yellow];
        "geology" [color = yellow];
        "VEG" [color = yellow];
        "landscape / landform age" [color = red];
       
        subgraph cluster0 {
                "surrounding terrain";
                "precip";
                "landscape position";
                "aspect";
                "local weather";
                "VEG";
                "solar radiation budget";
                "Cs log(As)";
                "soil moisture status";
                "xy water routing";
                "energy flux";
                "erosional / depositional surface";
        }

        subgraph cluster1 {
                "geology";
                "parent material";
                "soil texture";
                "infiltration rate";
                "leaching";
                "weathering rate";
                "landscape / landform age";
                "vertical soil development";
                "soil properties";
                "taxonomy";
        }
       
        "erosional / depositional surface" -> "weathering rate" ;
        "parent material" -> "landscape / landform age" -> "soil texture";
        "parent material" -> "weathering rate";
        "geology" -> "parent material" -> "soil texture";
       
        "landscape position" -> "xy water routing";
        "local weather" -> "solar radiation budget";
        "solar radiation budget" -> "soil moisture status" -> "VEG";
        "landscape position" -> "aspect" -> "solar radiation budget";
        "landscape position" -> "Cs log(As)" -> "erosional / depositional surface";
        "surrounding terrain" -> "solar radiation budget";
        "surrounding terrain" -> "landscape position";
        "surrounding terrain" -> "precip" -> "Cs log(As)";
        "Cs log(As)" -> "xy water routing";
        "xy water routing" -> "soil moisture status";
        "xy water routing" -> "energy flux" [style = dotted];
        "VEG" -> "energy flux";
        "soil moisture status" -> "energy flux" [style = dotted];
        "energy flux" -> "weathering rate";
        "soil texture" -> "infiltration rate";
        "soil texture" -> "weathering rate";
        "infiltration rate" -> "weathering rate";
        "infiltration rate" -> "leaching" -> "vertical soil development";

        "weathering rate" -> "vertical soil development";
        "landscape / landform age" -> "vertical soil development" [style = dotted];
        "vertical soil development" -> "soil properties";
        "vertical soil development" -> "taxonomy" [style = dotted];
        "soil properties" -> "taxonomy";
}

 
Example DOT Code:

# if you have a recent copy of graphviz:
# save directly to PDF
dot -Tpdf -o output.pdf input.dot

Graphviz Example OutputGraphviz Example Output

LaTeX Tips and Tricks

A collection of links to documents which I have found helpful in working with the LaTeX document preparation system. An internal link to all documents related to LaTeX can be accessed here.

 
Getting Started

 
CV Creation

 
Journal Submission

 
Collaborative Writing Ideas

Customizing BibTeX

 
Premise:

See attached files for a bibliography style (.bst) file that is compatible with the SSSAJ (Soil Science Society of America Journal) style guidelines.

 
General Notes:

  • when citing named chapters in a book, use the incollection entry type
  • when using makebst get the style as close as you can to the target, then tweak the resulting .bst file

 
Documentation:

http://www.andy-roberts.net/misc/latex/latextutorial3.html

 
A Custom Bib Style:

latex makebst

Latex to ODT conversion tips

 
Conversion of a .tex document to open document format (RTF)

latex2rtf file.tex

 
Conversion of a .tex document to open document format (ODT)

mk4ht oolatex file.tex

 
Some notes:

  1. disable hyperref package to prevent odd bug in citation listings
  2. input document should be latex-safe, i.e. eps figures
  3. need a working java installation
  4. commands like \onehalfspacing or begin{singlespace}...end{singlespace} may not work (setspace package)

 
Conversion of PDF, EPS, or Postscript figures to EMF (enhanced metafile format) for windows apps.

Note that these instructions are for unix-like systems. EMF support can be added to pstoedit via libEMF. The libEMF library contains several windows-specific coding practices, along with some sloppy use of include files. Instructions for compiling with a modern version of GCC (4.x) can be found here.

  1. Some related pages
    • http://www.physik.tu-cottbus.de/~george/unix/emf4unix.html
    • http://boshoff.za.net/linux/archives/2003_08.html
  2. download, hack, and compile the libEMF code (details)
  3. download, configure, and install pstoedit, note that you may have to manually define where the libEMF header and lib files are:
    ./configure --with-libemf-include=/usr/local/include/libEMF/ \
    --with-libemf-lib=/usr/local/lib/
  4. run ldconfig as root to refresh the library cache
  5. [Finally] Convert a PDF file into an OpenOffice-compatible EMF file (well sort of...)
    pstoedit -pta -f emf:"-OO -p -drawbb" file.pdf file.emf
  6. It seems like figures are clipped to smaller-than-expected bounding boxes unless the special flag -drawbb is used. One side-effect of this however is a box drawn around your figure. Also, the -pta flag is needed to ensure quality text placement. Even though the resulting EMF files may look odd on a Linux machine, they appear to work as expected in Windows.
  7. 

Making better looking tables with \ctable

 
Premise:
The standard Latex table environment can be difficult to extend, especially when one wants to use modern constructs such as table foot notes, etc. The ctable package is a convenient approach to solving this problem. A complete worked example of a table typset using the \ctable command is presented below. Note that \ctable is a command, and therefore does not allow blank newline characters. One way to maintain readability is to trick Latex by adding a comment character to all blank lines. Link to PDF manual for ctable.

Example of table produced with ctableExample of table produced with ctable

 
Code used to produce the example above. This example was produced with TexLive.

%
% start the table: note that we cannot have extra newline characters in the ctable defs
\ctable[
	cap     = {logistic regression parameters},
	% 
	caption = {Logistic regression model (M3) parameters. Coefficients, standard error, z-values, and p-values are included for each term used in the model. A separate slope and intercept term was fit to each geologic class.},
	% 
	label   = {aspect_effect:table:glm_model_properties},
	%
	%
]{lccc}{
	%
	\tnote[$\ast$]{Marginal p-values are used to determine whether each term is significantly different than 0.}
	%
}{ \FL
% 
Model Term & Value & Std. Error & Marginal p-values\tmark[$\ast$]      \ML
% 
Intercept & & &  \\  \cmidrule(r){1-1} 
% 
\;\;  andesite         &          1.146e+01   &   6.920e+00     &   0.09781  \\
\;\;  clastic\_volcanic   &        3.471e+00   &   2.234e+00    &       0.12024  \\
\;\;  coarse\_sedimentary  &       4.446e+00   &   2.390e+00    &    0.06285  \\
\;\;  felsic\_intrusive     &      5.384e+00   &   2.714e+00    &      0.04725  \\
\;\;  fine\_sedimentary   &        8.527e+00   &   4.013e+00      &   0.03361  \\
\;\;  rhyolite       &            1.075e+01   &   4.071e+00    &      0.00828 \\
\;\;  tuff              &         1.657e+01   &   9.116e+03    &      0.99855   \\
% 
% 
\\
Slope & & & \\   \cmidrule(r){1-1} 
% 
\;\;  andesite       &      -1.356e-03   &   8.448e-04     &  0.10858  \\ 
\;\;  clastic\_volcanic    &  -3.588e-04   &   2.851e-04     &   0.20825  \\
\;\;  coarse\_sedimentary  &  -8.165e-04   &   3.336e-04   &   0.01438  \\
\;\;  felsic\_intrusive    &   -6.352e-04    &  3.503e-04    &     0.06977  \\
\;\;  fine\_sedimentary    &   -1.122e-03   &   5.139e-04   &    0.02897  \\
\;\;  rhyolite       &       -1.361e-03  &    4.873e-04   &   0.00522  \\
\;\;  tuff         &         6.155e-14   &   1.200e+00   &   1.00000
% 
%  
\LL}

PostGIS: Spatially enabled Relational Database Sytem

PostGIS is a set of extensions to the relational database management system PostgreSQL, which provide access to spatial constructs, operators, and functions. In other words, PostGIS is a spatially-aware version of Postgresql. Paul Ramsey of Refractions Research has written up a nice summary of how were are using PostGIS.

 
Tuning Tips:

  1. Adjust kernel resources to allow larger memory allocation: (documentation)
  2. Increase the shared_buffers to a larger number (16000)
  3. See attached PDF at the bottom of this page for excellent ideas from Kevin Neufeld

 
PostGIS Syntax Examples:

  1. PostGIS Spatial SQL cheat-sheet
  2. Official Documentation
  3. PostGIS Wiki

 
PostGIS Presentations:

  1. Spatial Analysis with PostGIS presentation at PGCon2009 (Leo Hsu and Regina Obe)
  2. Tips for the PostGIS Power User (Kevin Neufeld, Refractions Research)
  3. PostGIS Spatial Database: Introduction and Case Studies (Paul Ramsey)
  4. Transitioning Low Earth Orbit Satellite Archive Data from Informix (Geodetic DataBlade) to PostgreSQL (PostGIS) (Churngwei Chu, NASA/SSAI)
  5. Intro to PostgreSQL/PostGIS and MapServer (Arnulf Christ)

Importing and Exporting

 

Tips

  • Attribute data stored with a shapefile (the .dbf file) can only use column names shorter than 14 characters. If you try and export a postGIS table that has column names longer than 14 characters, they will be truncated when converted to a shapefile.
  • If a column in postGIS is defined as numeric, ogr2ogr will not correctly interpret the values in this column: i.e. floating point values less than 1 will be truncated to 0. In order to avoid this problem always define your colums as double precision or float. For dynamically generated tables, cast affected columns to a defined numeric precision: select colum_a::numeric(7,3) where '7' refers to the total number of digits in the column, and '3' refers to the number of decimal places.

 

Geometry and attribute data to GIS file format

 
GDAL/OGR tools This approach allows simultaneous conversion of coordinate systems, but is less flexible with respect to generation of new tables in PostGIS.

  • Import
     
    ogr2ogr -f "PostgreSQL" PG:'dbname=ssurgo_combined user=xxxx password=xxxx host=postgis.server.edu' input_file.shp
  • Export
     
    ogr2ogr output_file.shp PG:'dbname=ssurgo_combined user=xxxx password=xxxx host=postgis.server.edu' tablename

     
    Note that tables must be correctly 'registered' in the geometry_columns table for this to work:
    INSERT INTO geometry_columns VALUES ('','public','tablename','wkb_geometry',2,SRID,'geomtype');

 
PostGIS Loader/Dumper This approach is the simplest, but does not allow on-the-fly conversion of coordinate systems.

  • Import
     
    shp2pgsql -s SRID -c -g wkb_geometry -I shapefile.shp schema.table | psql -U username -h host database

     
    Note that SRID is the PostGIS 'spatial ref. sys. id' (see the spatial_ref_sys table). See the manual page for shp2pgsql for a complete list of arguments and their meanings.

  • Export
     
    pgsql2shp -f shapefile.shp -h host -u username -P password -k -g wkb_geometry database schema.table

     
    See the manual page for pgsql2shp for a complete list of arguments and their meanings.

 
Where tablename is your newly created table, SRID is the SRID (spatial reference ID) for the geometry in this table, and geomtype is the type of geometry: POINT, LINE, POLYGON, etc.

 

Attribute data to text format

 
CSV format, from within the psql client
\copy tablename TO 'filename.csv' CSV HEADER

 
CSV format, via psql client
echo "select column_list from table_list " | psql --tuples --no-align  -F ","  database > file.csv

 
Tabular data to HTML format, via psql client See output below:
echo "select column_list from table_list " | psql --html database > file.html

 
HTML output from psql

area compname
132472.230854819 Hilmar variant
322819.967391312 Oneil
362729.418301135 Carranza
431948.171760353 Tuff rockland
448784.927049035 Gravel pits
500763.225267798 Snelling variant
518860.954990617 Foster
571640.132661382 Alamo
648973.748756059 Toomes
924327.631201791 Dumps

(10 rows)

Example Spatial SQL Operations on Point, Line and Polygon Geometry Types

  • Points
    • Example
    • Example
    • Example
  • Lines
    • Example
    • Example
    • Example
  • Polygons
    • Example
    • Example
    • Example
  • Mixed
    • Example
    • Example
    • Example

Affine Transformation Operations in PostGIS

 
Overview
The ST_Affine() function from PostGIS is useful for manipulating geometries, but requires the elements of a transformation matrix. This page documents progress on automating the computation of the transformation matrix by least-squares (Bruce Rindahl) via SQL. This would allow a pure PostGIS solution to computing and applying affine transformations to geometry data.

 
An open-source algorithm for computing the transformation matrix
Example code from GRASS (v.transform) was used as a template.

 
Approach
Compute transformation matrix based on a table of control points, stored as numbers.

 
Evaluation of results
Comparable to output from a similar analysis done in R, and the original algorithm as implemented in v.transform (GRASS).

Case Study: Fixing Bad TIGER Line data with R and PostGIS

Example of bad Tiger data in Stanislaus County: Red lines are the original road network, green lines are the corrected road network.Example of bad Tiger data in Stanislaus County: Red lines are the original road network, green lines are the corrected road network.

 
The Problem
The US Census does a nice job of collecting all sorts of geographic and demographic information every 10 years. This data is available free of charge in the rather complex and soon to be replaced TIGER/LINE format. While this data covers the entire US down to the local road level, there are numerous errors and even extreme cases of coordinate-shift. Here is an example from Stanislaus County, California. The original TIGER data (red lines) are offset several hundred meters from the imagery. While it is not clear what may have caused the problem, it can be fixed without much effort using an affine transformation. We do not have the transformation matrix, however it can be 'fit' to a set of control points by several methods. The general form of the affine transform can be conveniently represented in homogeneous coordinates as:


Affine Matrix Form: new coordinates on the left-hand side, old coordinates on the right-hand sideAffine Matrix Form in homogeneous coordinates:New coordinates on the left-hand side, old coordinates on the right-hand side. The transformation matrix is the 3x3 matrix in the center.

 
The Solution

  • transform using GRASS and a set of control points with v.transform. This is the natural choice if the data is already in GRASS.
  • Using the same control points, compute the transformation matrix in R, then apply in PostGIS with the Affine() function. This is the ideal approach for our scenario. Note that the format of the 'fitted' transformation matrix returned by R is in a slightly different format:


    Transformation Matrix from RTransformation Matrix from R

We first need a set of control points, good and bad coordinates. This can be accomplished in several ways, we used the d.where command in GRASS:

# load road subset:
v.in.ogr dsn=PG:'dbname=tiger2005se user=xxx password=xxx host=xxx' layer=stan_roads out=s
#
# pick up some control points: note that they will be interleaved with this approach
# odd line numbers are "bad" coordinates
# even line numbers are "good" coordinates
d.where > stan.c_points
#
# we can separate the "bad" and "good" points in R...

Computing the transformation matrix can be done with a simple regression between 'good' and 'bad' coordinates in R. Note that this approach was suggested by Prof. Brian Ripley on the R-help mailing list.

 
Compute the Affine Transformation Matrix in R

## read in control points
## the good and bad points are interleaved
## odd numbers are 'bad' points
## s <- read.table('stan.c_points')
&nbsp;
## s.bad <- s[seq(from=1, to=nrow(s), by=2), ]
## s.good <- s[seq(from=2, to=nrow(s), by=2), ]
##
## make a composite dataframe
## x,y = bad data
## nx,ny = good data
## c <- data.frame(x=s.bad$V1, y=s.bad$V2, nx=s.good$V1, ny=s.good$V2)
##
## save a local copy in the format that v.transform uses
## write.table(c, file='grass_control_pts.txt', row.names=F, col.names=F)
##
&nbsp;
## from now on just use the simplified format:
c <- read.table('grass_control_pts.txt')
names(c) <- c('x','y','nx','ny')

&nbsp;
## compute transformation matrix, and check model fit:
l <- lm(cbind(nx,ny) ~ x + y, data=c)
##
## see at bottom of page -->
summary(l)

&nbsp;
## convert to affine transform matrix to the form needed by PostGIS:
## transpose the regression coefficient matrix:
t(coef(l))
 (Intercept)           x          y
nx    5017.082  1.00231639 0.00918962
ny  -28138.395 -0.01352029 0.99740077


&nbsp;
## check graphically:
x <- seq(-2080000, -2040000, by=1000)
y <- seq(-20000, 5000, by=2000)
d <- expand.grid(x=x, y=y)
p <- predict(l, d)

&nbsp;
par(mfcol=c(1,2), pty='s')
plot(y ~ x, data=c, main='Control Points', cex=.4, pch=16)
points(ny ~ nx, data=c, col='red', cex=.4, pch=16)
## arrows(c$x, c$y, c$nx, c$ny, len=0.05, col='gray')

&nbsp;
plot(d, cex=0.2, main='Shifted Grid')
## arrows(d$x, d$y, p[,1], p[,2], len=0.05, col='gray')
points(p, col='red', cex=.2)

&nbsp;
## sample transformation
print(predict(l, data.frame(x=-2078417.35893968, y=-14810.57043808)), digits=10)
            nx           ny
 -2078350.804 -14809.67334

Establishing the transformation based on control points: Red points represent where the coordinates should be. Black points are the original and incorect coordinates.Establishing the transformation based on control points: Red points represent where the coordinates should be. Black points are the original and incorect coordinates.

 
Check Affine Transform Results in PostGIS

-- matrix format:

-- R:
-- | xoff a b |
-- | yoff d e |

-- postgis wants:
-- ST_Affine(geom, a, b, d, e, xoff, yoff)

-- as a query:
SELECT asText(
ST_Affine(
'POINT(-2078417.35893968 -14810.57043808)',
1.00231639, 0.00918962, -0.01352029, 0.99740077, 5017.082, -28138.395
)
) ;

                   astext                  
--------------------------------------------
 POINT(-2078350.80564006 -14809.6639251817)

-- This is the same as what R says!

 
Perform Affine Transformation in PostGIS

--
-- Step 1. Alter the geometry of the bad data in-place
-- note that we have a backup in 'stan_roads'
--
UPDATE roads SET wkb_geometry =  ST_Affine(wkb_geometry, 1.00231639, 0.00918962, -0.01352029, 0.99740077, 5017.082, -28138.395 ) WHERE module = 'TGR06099' ;

 
Regression Diagnostics

Response nx :

Call:
lm(formula = nx ~ x + y, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-207.088  -23.856    8.614   21.245  161.610 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 5.017e+03  1.369e+03    3.666  0.00079 ***
x           1.002e+00  6.654e-04 1506.386  < 2e-16 ***
y           9.190e-03  9.419e-04    9.756 1.20e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 52.25 on 36 degrees of freedom
Multiple R-Squared:     1,      Adjusted R-squared:     1 
F-statistic: 1.322e+06 on 2 and 36 DF,  p-value: < 2.2e-16 


Response ny :

Call:
lm(formula = ny ~ x + y, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.835 -18.459  -4.556  15.311  94.226 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.814e+04  7.409e+02  -37.98   <2e-16 ***
x           -1.352e-02  3.602e-04  -37.54   <2e-16 ***
y            9.974e-01  5.099e-04 1956.22   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 28.28 on 36 degrees of freedom
Multiple R-Squared:     1,      Adjusted R-squared:     1 
F-statistic: 2.187e+06 on 2 and 36 DF,  p-value: < 2.2e-16

Comparision with output from v.transform

 
First the output from R:
Looking at the residuals from the regression model used to map bad coordinates (x,y) to good coordinates (nx,ny):

## we made the linear model object 'l' above
## extract as dataframe
## residuals are computed for x and y separately
r <- as.data.frame(resid(l))

&nbsp;
## compute vector product of the (x,y) residuals:
c$resid <- sqrt(r$nx^2 + r$ny^2)
print(c)

          x          y       nx         ny      resid
1  -2078417 -14810.570 -2078314 -14838.378  46.617600
2  -2078743 -16057.955 -2078636 -16081.790  62.041274
3  -2077261 -16435.348 -2077170 -16463.156  40.905132
4  -2076709 -14405.369 -2076606 -14433.177  29.406399
5  -2074179 -15830.248 -2074084 -15901.558  33.111981
6  -2073850 -15707.435 -2073763 -15798.554  37.362736
7  -2073450 -13873.171 -2073359 -13920.712  21.623235
8  -2072359 -15204.613 -2072276 -15323.138  38.997678
9  -2072545 -14402.596 -2072450 -14513.219  32.918889
10 -2072189 -16022.434 -2072098 -16129.106  33.834074
11 -2071991 -16856.058 -2071928 -16942.976   6.277554
12 -2068407 -12999.396 -2068296 -13133.170   6.579285
13 -2069870 -12613.813 -2069764 -12731.848   2.357631
14 -2067635 -13188.253 -2067517 -13337.765  11.604519
15 -2066931 -13377.110 -2066809 -13518.753  22.719625
16 -2067411 -15084.692 -2067313 -15190.924  41.907273
17 -2066795 -18714.093 -2066741 -18846.019  14.541358
18 -2066384 -17080.538 -2066299 -17212.464  26.717495
19 -2068634 -19742.339 -2068580 -19835.464  27.483654
20 -2053326 -16930.710 -2053276 -17226.351  65.746074
21 -2051797 -17321.500 -2051899 -17579.762 227.516944
22 -2068307   2826.921 -2068066   2638.276  12.587853
23 -2067543   2648.205 -2067328   2449.631  37.729747
24 -2067126   4276.510 -2066904   4081.246  46.774630
25 -2066748   4170.604 -2066527   4001.816  59.509843
26 -2066068   2292.295 -2065860   2094.699  46.681553
27 -2065337   2107.872 -2065126   1900.397  43.386956
28 -2064606   1913.570 -2064378   1692.922  26.460389
29 -2064199   3558.561 -2063961   3356.401  47.742696
30 -2037464   6512.455 -2037076   5864.398  50.762994
31 -2036722   6825.682 -2036338   6199.227  22.699467
32 -2036876   6366.642 -2036498   5742.888  22.120176
33 -2040225   7150.180 -2039706   6575.029 161.631199
34 -2041064   7144.779 -2040732   6569.629  26.657582
35 -2044702 -15548.033 -2044564 -16024.903  23.844817
36 -2043992 -15723.521 -2043824 -16223.282  48.063840
37 -2043790 -14907.119 -2043611 -15383.990  34.844851
38 -2040616 -14820.445 -2040453 -15349.974  21.196233
39 -2039595 -15081.427 -2039485 -15588.263  47.263287

 
The Root-Mean-Square-Error (RMSE) for the fitted transform (in meters) is:

apply( (fitted(l) - c[, c('nx', 'ny')])^2  , 2, function(x) sqrt(mean(x)) )
      nx       ny
50.19954 27.17333

&nbsp;
#converting to the vector sum of RMSE (as reported in v.transform):
sqrt(50.19954^2 + 27.17333^2)
57.08225

 
The output from v.transform on the same set of control points:

Transformation Matrix
| xoff a b |
| yoff d e |
-------------------------------------------
5301.399323 1.002469 0.009172 
-28155.882288 -0.013530 0.997547 
-------------------------------------------

 
full output including the residuals:

               CHECK MAP RESIDUALS
                Current Map                  New Map
 POINT      X coord    Y coord  |        X coord   Y coord    |      residuals

  1.   -2078417.36    -14810.57 |  -2078314.07      -14838.38 |        46.81
  2.   -2078743.11    -16057.95 |  -2078635.85      -16081.79 |        62.22
  3.   -2077261.34    -16435.35 |  -2077169.97      -16463.16 |        41.05
  4.   -2076709.16    -14405.37 |  -2076605.87      -14433.18 |        29.59
  5.   -2074178.76    -15830.25 |  -2074083.67      -15901.56 |        33.21
  6.   -2073849.93    -15707.44 |  -2073762.78      -15798.55 |        37.42
  7.   -2073449.80    -13873.17 |  -2073358.68      -13920.71 |        21.62
  8.   -2072358.86    -15204.61 |  -2072275.89      -15323.14 |        39.02
  9.   -2072544.55    -14402.60 |  -2072449.73      -14513.22 |        32.97
 10.   -2072188.97    -16022.43 |  -2072098.11      -16129.11 |        33.87
 11.   -2071991.43    -16856.06 |  -2071928.22      -16942.98 |         6.27
 12.   -2068406.55    -12999.40 |  -2068296.38      -13133.17 |         6.60
 13.   -2069870.19    -12613.81 |  -2069763.96      -12731.85 |         2.33
 14.   -2067635.38    -13188.25 |  -2067517.35      -13337.76 |        11.63
 15.   -2066931.10    -13377.11 |  -2066809.13      -13518.75 |        22.74
 16.   -2067411.11    -15084.69 |  -2067312.75      -15190.92 |        41.93
 17.   -2066795.16    -18714.09 |  -2066740.84      -18846.02 |        14.64
 18.   -2066383.87    -17080.54 |  -2066298.50      -17212.46 |        26.74
 19.   -2068634.37    -19742.34 |  -2068580.05      -19835.46 |        27.53
 20.   -2053326.48    -16930.71 |  -2053275.51      -17226.35 |        66.09
 21.   -2051797.30    -17321.50 |  -2051899.25      -17579.76 |       227.91
 22.   -2068307.24      2826.92 |  -2068065.64        2638.28 |        12.41
 23.   -2067542.73      2648.21 |  -2067327.61        2449.63 |        37.44
 24.   -2067125.72      4276.51 |  -2066903.98        4081.25 |        46.40
 25.   -2066748.43      4170.60 |  -2066526.69        4001.82 |        59.12
 26.   -2066067.79      2292.29 |  -2065860.31        2094.70 |        46.35
 27.   -2065336.69      2107.87 |  -2065125.92        1900.40 |        43.07
 28.   -2064605.58      1913.57 |  -2064378.35        1692.92 |        26.16
 29.   -2064199.15      3558.56 |  -2063961.13        3356.40 |        47.43
 30.   -2037464.39      6512.45 |  -2037075.56        5864.40 |        50.66
 31.   -2036721.82      6825.68 |  -2036338.39        6199.23 |        22.54
 32.   -2036875.74      6366.64 |  -2036497.71        5742.89 |        21.95
 33.   -2040224.67      7150.18 |  -2039706.23        6575.03 |       161.54
 34.   -2041064.45      7144.78 |  -2040732.32        6569.63 |        26.74
 35.   -2044701.68    -15548.03 |  -2044564.34      -16024.90 |        23.64
 36.   -2043992.10    -15723.52 |  -2043824.24      -16223.28 |        47.60
 37.   -2043789.90    -14907.12 |  -2043610.60      -15383.99 |        34.35
 38.   -2040615.94    -14820.44 |  -2040453.30      -15349.97 |        20.77
 39.   -2039594.70    -15081.43 |  -2039485.02      -15588.26 |        47.85


  Number of points: 39
  Residual mean average: 57.082951

Higher Order Transformations

 
... Continuing from the initial example session in R ...

An affine transformation is based on a linear mapping between two coordinate-spaces. Testing for non-linearity (i.e. higher order model terms) can be a useful diagnostic in choosing the simpler affine transform.

 
Compute the difference between the good and bad coordinates

## compute vector difference and save back to original data frame:
c$sqdiff <- with(c, sqrt((nx-x)^2 + (ny-y)^2))

 
Generate two linear models:

## generate simple model
l.simple <- lm(sqdiff ~ nx + ny, data=c)
## model with higher order terms (up to 3rd-order):
## note to self: poly() computes orthagonal polynomials
l.poly <- lm(sqdiff ~ poly(nx, 3) + poly(ny, 3), data=c)

&nbsp;
## summarize the complex model:
summary(l.poly)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    302.38       3.97  76.174  < 2e-16 ***

poly(nx, 3)1  1091.17      35.82  30.466  < 2e-16 ***
poly(nx, 3)2   165.59      32.78   5.051 1.71e-05 ***
poly(nx, 3)3   -51.21      28.74  -1.782   0.0843 .  

poly(ny, 3)1   417.35      31.97  13.056 2.30e-14 ***
poly(ny, 3)2   -18.50      30.04  -0.616   0.5425    
poly(ny, 3)3    19.41      35.49   0.547   0.5882    

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

 
Is one model significantly better than the other?

## check via RMSE:
## simple model
sqrt( mean((fitted(l.simple) - c$sqdiff)^2) )
36.54747

&nbsp;
## model with 3rd-order polynomial
sqrt( mean((fitted(l.poly) - c$sqdiff)^2) )
22.45530

&nbsp;
## the anova function can compare nested models:
anova(l.simple, l.poly)

Analysis of Variance Table

Model 1: sqdiff ~ nx + ny
Model 2: sqdiff ~ poly(nx, 3) + poly(ny, 3)
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1     36 52093                                  
2     32 19665  4     32428 13.192 1.865e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 
Check visually:

## re-make grid of points to predict on:
## and save two versions to 'stack-up'
x <- seq(-2080000, -2040000, by=1000)
y <- seq(-20000, 5000, by=2000)
d.simple <- expand.grid(x=x, y=y)
d.poly <- expand.grid(x=x, y=y)

&nbsp;
## and predict the trend in differences:
d.simple$diff <- predict(l.simple, data.frame(nx=d$x, ny=d$y))
d.poly$diff <- predict(l.poly, data.frame(nx=d$x, ny=d$y))

&nbsp;
## stack for grouped plotting
library(lattice)
library(RColorBrewer)
d.combined <- make.groups(simple=d.simple, poly=d.poly)

&nbsp;
## plot:
wireframe(diff ~ x * y | which, data=d.combined, scales=list(arrows=TRUE), drape=TRUE, col.regions=brewer.pal(n=9, name='Oranges'), cuts=8, col='white')

Testing for linearity: Two visualizations of the deviance between coordinates positions, at the control point locations.Testing for linearity: Two visualizations of the deviance between coordinates positions, at the control point locations.

 
Conclusions:
It seems that a second order term was only warranted along the x-direction. The more complex model based on 3rd-order polynomials results in a significantly lower RMSE (about 10 meters lower), and is shown to be a better descriptor of variance in the test of nested models.

At the map scale in which the corrected data will be presented, the extra accuracy suggested by the more complex model (coordinate transformation function) is negligible. This allows for the simpler model, which can be directly used by the convenient ST_Affine() function in PostGIS for the heavy-lifting.

Transformation parameters query

The following is a proof of concept query showing how a PostgreSQL query could give the transformation parameters required for an affine transformation. The source of the procedure is from GRASS at transform.c.

Note that the v.transform code expects the input file (control points) to be in the form ax ay bx by, where 'a' is the starting coordinate system (the bad coordinates in the previous example) and 'b' is the target system (the good coordinates). The

 
SQL version is as follows:

 
Sample Session

  1. CREATE TABLE link (
        gid integer NOT NULL,
        a_x double precision,
        a_y double precision,
        b_x double precision,
        b_y double precision
    );
    -- load data into table
    -- see attached file at bottom of page
    -- possibly use the file 'grass_control_pts.csv' ???
    COPY link FROM 'control_points.txt' ;
  2. CREATE TEMPORARY TABLE cc AS
    SELECT
    (SELECT count(b_x) FROM link) AS cc00,
    (SELECT sum(b_x) FROM link) AS cc01,
    (SELECT sum(b_y) FROM link) AS cc02,
    sum(l1.b_x * l2.b_x) AS cc11,
    sum(l1.b_x * l2.b_y) AS cc12,
    sum(l1.b_y * l2.b_y) AS cc22
     FROM link l1 , link l2 WHERE l1.gid = l2.gid;
  3. CREATE TEMPORARY TABLE aa AS
    SELECT
    (SELECT sum(a_y) FROM link) AS aa0,
    sum(l1.a_y * l2.b_x) AS aa1,
    sum(l1.a_y * l2.b_y) AS aa2
     FROM link l1 , link l2 WHERE l1.gid = l2.gid;
  4. CREATE TEMPORARY TABLE bb AS
    SELECT
    (SELECT sum(a_x) FROM link) AS bb0,
    sum(l1.a_x * l2.b_x) AS bb1,
    sum(l1.a_x * l2.b_y) AS bb2
     FROM link l1 , link l2 WHERE l1.gid = l2.gid;
  5. CREATE TEMPORARY TABLE det AS
    SELECT
    cc00*cc11*cc22 +
    cc01*cc12*cc02 +
    cc02*cc01*cc12 -
    cc00*cc12*cc12 -
    cc01*cc01*cc22 -
    cc02*cc11*cc02 AS det FROM cc;
  6. CREATE TEMPORARY TABLE inv_cc AS
    SELECT
    (cc11*cc22-cc12*cc12)/det AS cc00,
    (cc12*cc02-cc01*cc22)/det AS cc01,
    (cc01*cc12-cc11*cc02)/det AS cc02,
    (cc00*cc22-cc02*cc02)/det AS cc11,
    (cc01*cc02-cc00*cc12)/det AS cc12,
    (cc00*cc11-cc01*cc01)/det AS cc22
    FROM cc,det;
  7. SELECT
    (aa0*cc00+aa1*cc01+aa2*cc02) AS b0,
    (aa0*cc01+aa1*cc11+aa2*cc12) AS b1,
    (aa0*cc02+aa1*cc12+aa2*cc22) AS b2,

    (bb0*cc00+bb1*cc01+bb2*cc02) AS b3,
    (bb0*cc01+bb1*cc11+bb2*cc12) AS b4,
    (bb0*cc02+bb1*cc12+bb2*cc22) AS b5
    FROM aa,bb,inv_cc;

This query requires a table called link with the following fields - gid (primary key), a_x, a_y , b_x, b_y. The 'a' values are the 'from' coordinates and the 'b' values are the 'to' coordinates. Using the attached control points the result of this query is:

        b0        |         b1          |        b2         |        b3        |        b4        |         b5          
------------------+---------------------+-------------------+------------------+------------------+---------------------
 -28138.394850347 | -0.0135202854235867 | 0.997400773420259 | 5017.08164289594 | 1.00231638907948 | 0.00918961946271679

 
These are the exact results from 'R'

User defined Function in PL/pgSQL to compute the transformation parameters

Based on the results of the proof of concept example developed previously, a single function was developed in the procedural language for the PostgreSQL database system called PL/pgSQL. The only input parameter to this procedure is a text string that results in a table in the following format:

gid (integer), from_x (float), from_y (float), to_x (float), to_y (float)

Note the table must have the above fields but they can be in any order and can have additional fields. The gid field must be unique for each record.

The use of a SQL query found to be the simplest way to avoid the difficulties in programing the procedure without the need for temporary tables. An added benefit is the control point data can be in almost any format as long as it can be arranged in the format specified above. For example if the "from" points are in a geometry column (the_geom) in a table called from_pts and the corresponding "to" points are in a similar table called to_pts with a common attribute called "link_id", then the query would be:

SELECT from_pts.link_id AS gid, x(from_pts.the_geom) AS from_x, y(from_pts.the_geom) AS from_y, x(to_pts.the_geom) AS to_x, y(to_pts.the_geom) AS to_y FROM from_pts, to_pts WHERE from_pts.link_id = to_pts.link_id

 
Other table layouts and queries are possible depending on the manner in which the control points are collected.

 
The following is the SQL code to add a new procedure called trans_param() into a PostGIS database:

-- Function: trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision)

-- DROP FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision);

CREATE OR REPLACE FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision) AS
$BODY$
DECLARE
        cc_row record;

        cc_det float;

        inv_cc00 float;
        inv_cc01 float;
        inv_cc02 float;
        inv_cc11 float;
        inv_cc12 float;
        inv_cc22 float;

BEGIN
        EXECUTE 'SELECT
        count(a.to_x) as cc00,
        sum(a.to_x) as cc01,
        sum(a.to_y) as cc02,
        sum(a.to_x * b.to_x) as cc11,
        sum(a.to_x * b.to_y) as cc12,
        sum(a.to_y * b.to_y) as cc22,
        sum(a.from_y) as aa0,
        sum(a.from_y * b.to_x) as aa1,
        sum(a.from_y * b.to_y) as aa2,
        sum(a.from_x) as bb0,
        sum(a.from_x * b.to_x) as bb1,
        sum(a.from_x * b.to_y) as bb2
        from ('
|| sql || ') a,(' || sql|| ') b WHERE a.gid = b.gid' INTO cc_row;


        SELECT INTO cc_det
                cc_row.cc00*cc_row.cc11*cc_row.cc22 +
                cc_row.cc01*cc_row.cc12*cc_row.cc02 +
                cc_row.cc02*cc_row.cc01*cc_row.cc12 -
                cc_row.cc00*cc_row.cc12*cc_row.cc12 -
                cc_row.cc01*cc_row.cc01*cc_row.cc22 -
                cc_row.cc02*cc_row.cc11*cc_row.cc02;

        SELECT INTO inv_cc00 (cc_row.cc11*cc_row.cc22-cc_row.cc12*cc_row.cc12)/cc_det;
        SELECT INTO inv_cc01 (cc_row.cc12*cc_row.cc02-cc_row.cc01*cc_row.cc22)/cc_det;
        SELECT INTO inv_cc02 (cc_row.cc01*cc_row.cc12-cc_row.cc11*cc_row.cc02)/cc_det;
        SELECT INTO inv_cc11 (cc_row.cc00*cc_row.cc22-cc_row.cc02*cc_row.cc02)/cc_det;
        SELECT INTO inv_cc12 (cc_row.cc01*cc_row.cc02-cc_row.cc00*cc_row.cc12)/cc_det;
        SELECT INTO inv_cc22 (cc_row.cc00*cc_row.cc11-cc_row.cc01*cc_row.cc01)/cc_det;

        SELECT INTO a cc_row.bb0*inv_cc01+cc_row.bb1*inv_cc11+cc_row.bb2*inv_cc12;
        SELECT INTO b cc_row.bb0*inv_cc02+cc_row.bb1*inv_cc12+cc_row.bb2*inv_cc22;
        SELECT INTO d cc_row.aa0*inv_cc01+cc_row.aa1*inv_cc11+cc_row.aa2*inv_cc12;
        SELECT INTO e cc_row.aa0*inv_cc02+cc_row.aa1*inv_cc12+cc_row.aa2*inv_cc22;
        SELECT INTO xoff cc_row.bb0*inv_cc00+cc_row.bb1*inv_cc01+cc_row.bb2*inv_cc02;
        SELECT INTO yoff cc_row.aa0*inv_cc00+cc_row.aa1*inv_cc01+cc_row.aa2*inv_cc02;

END;

$BODY$
  LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision) OWNER TO postgres;

 
Currently there is no error checking in the code if the determinant is zero.

 
To use the procedure simply use: SELECT trans_param('my SQL text')

 
The data from this series of articles is stored in a table called link. The id of the points is gid and the "from" values are b_x and b_y. The "to" values are a_x and a_y. Thus the query is:

SELECT gid,b_x AS from_x, b_y AS from_y, a_x AS to_x, a_y AS to_y FROM link

 
The result of the procedure is:

SELECT * FROM trans_param(
'SELECT gid,b_x as from_x, b_y as from_y, a_x as to_x, a_y as to_y from link')

0.997546509279282;-0.00917177514909895;0.0135300872142122;1.00246938174737;-5301.39933295548;28155.8822879205

 
Which matches the results returned from GRASS and R. Additional queries will be developed to give a table of residuals, the RMS error and the actual transformation of geometry.

Control Points Table Format

This is a start for discussions to create a series of function to perform an affine transformation of a PostGIS data set using a table of control points.

The first step is what is the format or layout of the control points?

I don't think point geometries is necessarily a good idea. The points must have an exact 1 to 1 relationship. To assure this you either have to maintain absolute integrity on the keys between two tables or have two geometries in one table. Both would be a hassle. An interleaved table format also would give me problems because the queries get really difficult and what if one version used "good" points in the odd rows and the next one put them in the even ones? In addition, if you use a seguential id and add one row "good" then add a "bad" one, then realize the bad one is really bad and delete it then the gid's will not be odd/even anymore in relation to "good" vs "bad". I think ESRI has the right idea here (God did I actually say that?). Look at the link table interface when you are georeferencing an image. It gives the id, XSource, YScource, Xmap, YMap, and the residual error. This could easily be done with the code - just one more query.

Given a table in this format a query could be done to give the RMS error. Also the table could be returned with the residual errors. When the user is happy then the table and the geometry could be input to transform the geometry.

Bruce

Analysis of SSURGO Data in PostGIS: An Overview

 
News
It looks like the NCSS has constructed a web-based, SQL interface to their main database. This new tool was recently highlighted in issue 40 of the NCSS Newsletter and looks like a promising new tool with good documentation. A listing of "web services" offered by the NCSS is presented on this page.

 
Overview
The analysis of SSURGO data is complicated by a series of hierarchical, one-to-many tabular relationships between spatial and attribute data. Understanding the structure of the SSURGO database is critical for correct interpretation and aggregation of soil properties. Before undertaking any SSURGO-based analysis please take some time to become familliar with the details on the product, including intended uses, minimum mapping units sizes, and other important details. In addition, becoming familliar with the SSURGO metadata is a must. The metadata page includes detailed descriptions of table structure, column names and units, as well as important information on the sources of much of the tabular data included in a SSURGO database.

SSURGO data can be downloaded by survey area from The Soil Datamart, with spatial data delivered in shapefile format and attribute data delivered as plain text. Unfortunately, an M.S. Access database template is required to utilize SSURGO attribute data as delivered from the Soil Datamart. For assistance with this procedure please see the NRCS SSURGO page. Using this approach, most analysis of SSURGO must be done survey-by-survey within a GIS environment. For general instructions see this document.

Several online applications allow for simple interaction with the SSURGO database without the need for a GIS or RDBMS knowledge. Some examples include:

 
An Open Source Approach to SSURGO
We have developed an alternate approach to working with SSURGO data using PostGIS, a spatially-enabled version of the popular and open source relational database management system Postgresql, to store all spatial and tabular data for 138 survey areas. This approach facillitates rapid access, analysis, and aggregation of over half a million soil polygons. SQL (structured query language) is used to directly interact with both soil spatial and attribute data. If other forms of spatial data are imported into PostGIS (such as landcover, climatic data, etc.) nearly all spatial and attribute analysis can be done entirely from PostGIS. A series of examples illustrating common tasks will be presented in the following pages.

 
General Approach to Working with SSURGO (also outlined in this document) (more ideas on map unit composition)

  1. Identify the soil properties that are to be included in some analysis
  2. Decide on the appropriate form of aggregation to be used to summarize horizons
    • depth weighted (i.e. to calculate average percent clay through horizons)
    • top 1m (i.e. to summarize shrink-swell capacity of the topsoil)
    • top horizon (i.e. to summarize surface organic carbon)
    • profile sum (i.e. to calculate total water holding capacity)
    • most limiting (i.e. to calculate the most limiting hydraulic conductivity within the soil profile)
  3. Aggregate horizon data: after filtering NULL values and weights
  4. Decide on the appropriate form of aggregation to be used to summarize components
    • component percent weighted (i.e. this will include information from each component, weighted by their estimated percentage of the entire map unit)
    • largest component (i.e. this usually result in the selection of the 'dominant' component, however when there are multuple components with the same estimated percentage ties will occur)
    • major component flag (i.e. components flagged as a 'major component' by the NRCS represent dominant soil types within a map unit. note that there are sometimes multiple components flagged as 'major components'.)
    • dominant condition (i.e. this is usually used for categorical data like hydric conditions. aggregation is performed by selecting the most frequent condition within a map unit)
  5. Aggregation of component data: after filtering NULL values and weights
  6. join the above aggregated data (2x aggregation process for horizon data) to the map unit polygons
  7. See the diagram at the bottom of this page for a graphical summary

SSURGO discontinuity example: Boundary between Glenn and Colusa counties, illustrating differences in soil survey vintage.SSURGO discontinuity example: Boundary between Glenn and Colusa counties, illustrating differences in soil survey vintage.

 
Notes on the Format of SSURGO

  • Adjacent surveys may have been composed by different individuals, and may be of widely different vintages. Any given survey must comply with basic standards, but older surveys reflect a more generalized approach than more modern surveys. The figure to the right illustrates such differences.
  • Polygons represent a repeating pattern of legend entries: groups of map-able soil concepts called map units
  • Map unit data is stored in the mapunit table, and is referenced by the field mukey
  • Pre-aggregated map unit data is stored in the muaggatt table, and is referenced by the field mukey
  • Map units are comprised of multiple, unmapped soil types called 'components'
  • Component data is stored in the component table, and is referenced by the field cokey
  • Soil components (or soil type) are associated with multiple horizons
  • Horizon data is stored in the chorizon table, and is referenced by the field cokey
  • Since there is a 1:many:many (mapunit:component:horizon) relationship between spatial and horizon-level soil property data two aggregation steps are required in order to produce a thematic map
  • This diagram illustrates the hierarchy of scales involved in soil survey information.

SSURGO Database Structure DiagramSSURGO Database Structure Diagram

Logistics: Getting Connected and Executing Queries

 
Building and Saving SQL code snippets
A simple text editor is the best environment for working on (and saving!) your SQL queries. Linux users are encouraged to use either 'Kate' or 'Kwrite'. Windows users should look into Notetab Light or SciTE. Mac users should check out TextWrangler.

 
Connecting to the database
In general, the simplest way to interact with our composite soil survey database is by connecting to the best with an SSH client. Mac/Linux users have this functionality built-in. Windows users will need to use something like Putty, or Xming. Once an ssh connection with the beast has been setup, you can connect to the database with the following:

psql -U soil ssurgo_combined

 
where psql is the postgresql client program, -U soil means connect as the user called "soil", and ssurgo_combined is the name of the actual database.

 
You can quit from the database shell using \q (followed by enter). Typing \? (followed by enter) will give a list of commands available in the psql shell.

 
Query structure and SQL
Numerous resources exisit for learning about SQL. See the attached PDF presentation at the bottom of this page for some SSURGO-related examples. I would recommend looking at the first couple chapters from the PostgreSQL book by Douglas and Douglas (the big purple book on the shelf). For an interactive learning approach SQL Zoo seems like a good start. In general most queries will have the format:

SELECT column_x, column_y, column_z
FROM table_x
WHERE column_x = 'something'
-- optional
GROUP BY column_x
ORDER BY column_x ; -- semi-colon denotes end of SQL statement

 
A simple query: column selection and filtering
Once connected it is possible to issue queries to the database. The results of a query are normally returned to the screen. For example, asking for the horizon boundaries and horizon names from the componenent identified with the given cokey (467038:646635) might look like this:

-- the query
SELECT cokey, hzname, hzdept_r, hzdepb_r
FROM chorizon
WHERE cokey = '467038:646635' ;

-- the results
     cokey     | hzname | hzdept_r | hzdepb_r
---------------+--------+----------+----------
 467038:646635 | Ap     |        0 |       18
 467038:646635 | Bw     |       18 |       41
 467038:646635 | Bk1    |       41 |       69
 467038:646635 | Bk2    |       69 |      109
 467038:646635 | Bk3    |      109 |      145
 467038:646635 | Bk4    |      145 |      183

 
A more complicated query: column calculation and aggregation
Compute the total water holding capacity for a given component, identified by (467038:646635). Note that several operations are being performed on the data:

  1. compute the available water holding capacity (in centemeters) for each horizon [green text]
  2. compute the profile depth (in centemeters) for each horizon [green text]
  3. aggregate the profile data by summing AHC and depth of each horizon [red text]
-- the query
select cokey, sum( (hzdepb_r - hzdept_r) * awc_r) as component_whc, sum( (hzdepb_r - hzdept_r)  ) as depth 
from chorizon 
where cokey = '467038:646635'
group by cokey ;

-- the results
     cokey     | component_whc | depth 
---------------+---------------+-------
 467038:646635 |         27.91 |   183

 
Note that it is also possible to save the results of a query into a new table. This is the usual strategy when performing any analysis that returns geometry (soil polygons, etc.) as geometry data cannot be visualized in a text display. Tables created in this manner can be exported from the database for map creation or further analysis. Spatial tables can be viewed directly by applications like QGIS or Mapserver. A simple method of accessing PostGIS data is not yet possible with ArcGIS.

Checking Type Locations

 
Just Checking

-- NAD27 to NAD83 
echo 119d7\'4\"W 36d23\'13\"N | cs2cs +proj=latlong +datum=NAD27 +to +proj=latlong +datum=NAD83 -f "%.6f"
-119.118718     36.386894 0.000037

 

-- the relevent SRIDs

 srid |                     proj4text                      
------+----------------------------------------------------
 4267 | +proj=longlat +ellps=clrk66 +datum=NAD27 +no_defs
 4269 | +proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs
 4326 | +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs


-- NAD27 type location point
\SET pt Transform(SetSRID(ST_MakePoint(-119.1177777, 36.38694444), 4267), 9001)

-- find intersecting map unit, and the NAD83 coords
SELECT AsText(Transform(:pt, 4269)) AS NAD83_pt, mapunit_poly.mukey, muname
FROM mapunit_poly
JOIN mapunit
ON mapunit_poly.mukey = mapunit.mukey
AND st_intersects(wkb_geometry, :pt) ;

                 nad83_pt                  | mukey  |                         muname                        
-------------------------------------------+--------+--------------------------------------------------------
 POINT(-119.118718055326 36.3868941884307) | 467194 | Grangeville sandy loam, drained, 0 TO 2 percent slopes

 
 
-- 250 m buffer around NAD27 type location coords
\SET box st_buffer(Transform(SetSRID(ST_MakePoint(-119.1177777, 36.38694444), 4267), 9001), 250)
 
-- select overlapping map units and their associated percent overlap
SELECT mapunit_poly.mukey, muname,
round((ST_Area(ST_Intersection(wkb_geometry, :box)) / ST_Area(:box)) * 100) AS area_pct
FROM mapunit_poly
JOIN mapunit
ON mapunit_poly.mukey = mapunit.mukey
AND st_intersects(wkb_geometry, :box)
ORDER BY area_pct DESC ;


 mukey  |                         muname                         | area_pct
--------+--------------------------------------------------------+----------
 467194 | Grangeville sandy loam, drained, 0 TO 2 percent slopes |       70
 463596 | Grangeville silt loam, drained                         |       12
 467206 | Riverwash                                              |       11
 467210 | Tujunga loamy sand, 0 TO 2 percent slopes              |        7

Getting Parent Material Data out of SSURGO

 
Parent material data is stored within the copm and copmgrp tables. The copm table can be linked to the copmgrp table via the 'copmgrpkey' field, and the copmgrp table can be linked to the component table via the 'cokey' field. The following queries illustrate these table relationships, and show one possible strategy for extracting the parent material information associated with the largest component of each map unit.

 
Several of the example queries are based on this map unit:

 
Query copmgrp table

SELECT *
FROM copmgrp
WHERE cokey = '461573:631329' ;

 areasymbol | pmgroupname | rvindicator |     cokey     |  copmgrpkey  
------------+-------------+-------------+---------------+--------------
 ca011      | alluvium    | Yes         | 461573:631329 | 461573:78270

 
Query copm table

SELECT *
FROM copm
WHERE copmgrpkey = '461573:78270' ;

 areasymbol | pmorder | pmmodifer | pmgenmod |  pmkind  | pmorigin |  copmgrpkey  |    copmkey    
------------+---------+-----------+----------+----------+----------+--------------+---------------
 ca011      |         |           |          | Alluvium |          | 461573:78270 | 461573:101001

 
Join parent material tables

SELECT cokey, pmgroupname, rvindicator, pmkind
FROM copmgrp
LEFT JOIN copm USING (copmgrpkey)
WHERE cokey = '461573:631329' ;

     cokey     | pmgroupname | rvindicator |  pmkind  
---------------+-------------+-------------+----------
 461573:631329 | alluvium    | Yes         | Alluvium

 
Join component table to parent material tables

SELECT mukey, cokey, comppct_r, taxorder, taxsuborder, taxgrtgroup, taxsubgrp, compname, pm.pmgroupname, pm.rvindicator, pm.pmkind
FROM
component
LEFT JOIN
        (
        -- get parent material data for each cokey
        SELECT cokey, pmorder, pmmodifer, pmgenmod, pmgroupname, rvindicator, pmkind
        FROM copmgrp
        LEFT JOIN copm USING (copmgrpkey)
        ) AS pm
USING (cokey)
WHERE mukey = '461573' ;

 mukey  |     cokey     | comppct_r | taxorder  | taxsuborder | taxgrtgroup  |      taxsubgrp      | compname | pmgroupname | rvindicator |  pmkind  
--------+---------------+-----------+-----------+-------------+--------------+---------------------+----------+-------------+-------------+----------
 461573 | 461573:631329 |        90 | Alfisols  | Xeralfs     | Palexeralfs  | Typic Palexeralfs   | Hillgate | alluvium    | Yes         | Alluvium
 461573 | 461573:631330 |         5 | Alfisols  | Xeralfs     | Haploxeralfs | Typic Haploxeralfs  | Arbuckle | alluvium    | Yes         | Alluvium
 461573 | 461573:631331 |         2 | Entisols  | Fluvents    | Xerofluvents | Mollic Xerofluvents | Arand    | alluvium    | Yes         | Alluvium
 461573 | 461573:631332 |         1 |           |             |              |                     | Unnamed  | alluvium    | Yes         | Alluvium
 461573 | 461573:631333 |         2 | Mollisols | Xerolls     | Haploxerolls | Pachic Haploxerolls | Westfan  | alluvium    | Yes         | Alluvium

 
Extract component data and corresponding parent material data for the largest component of each map unit

CREATE TEMP TABLE temp_component_data AS
SELECT DISTINCT ON (mukey) mukey, cokey, comppct_r, taxorder, taxsuborder, taxgrtgroup, taxsubgrp, compname, pm.pmgroupname, pm.rvindicator, pm.pmkind, pm.pmorder, pm.pmmodifer, pm.pmgenmod
FROM
component
LEFT JOIN
        (
        -- get parent material data for each cokey
        SELECT cokey, pmorder, pmmodifer, pmgenmod, pmgroupname, rvindicator, pmkind
        FROM copmgrp
        LEFT JOIN copm USING (copmgrpkey)
        ) AS pm
USING (cokey)
WHERE component.areasymbol IN ('ca624', 'ca628')
AND majcompflag = 'Yes'
ORDER BY mukey, comppct_r DESC ;

-- add an index so that joining with geometry is faster
CREATE INDEX temp_component_data_mukey_idx ON temp_component_data (mukey) ;
VACUUM ANALYZE temp_component_data;

 
Combine with geometry and prepare for export

-- combine with geometry
CREATE TABLE temp_combined_data AS
SELECT ogc_fid, wkb_geometry, temp_component_data.*
FROM mapunit_poly
LEFT JOIN temp_component_data USING (mukey)
WHERE mapunit_poly.areasymbol IN ('ca624', 'ca628') ;

 
Export to shapefile

# export
pgsql2shp -k -f data.shp -h the_host_machine -P the_username -u the_password -g wkb_geometry ssurgo_combined temp_combined_data

 
Remove temporary tables

-- clean-up
DROP TABLE temp_combined_data ;

Learning by Example: Common Queries

 
Selection, Filtering and Sorting

  • Selecting a summary of horizon thickness, sand, silt, clay and AWC for a given component (chorizon table).
    SELECT  hzname AS name, hzdepb_r - hzdept_r AS thickness, sandtotal_r AS sand, silttotal_r AS silt, claytotal_r AS clay, awc_r AS awc
    FROM chorizon
    WHERE cokey = '467038:646635'
    ORDER BY hzdept_r;

    Query Results
    name | thickness | sand | silt | clay | awc
    --------+-----------+------+------+------+------
    Ap | 18 | 35 | 37 | 28 | 0.16
    Bw | 23 | 35 | 40 | 25 | 0.15
    Bk1 | 28 | 35 | 40 | 25 | 0.16
    Bk2 | 40 | 35 | 40 | 25 | 0.16
    Bk3 | 36 | 35 | 40 | 25 | 0.16
    Bk4 | 38 | 60 | 25 | 15 | 0.13
  • Select some common fields from the component table:
     SELECT cokey, majcompflag, comppct_r, wei, weg, tfact
    FROM component
    WHERE mukey = '467038'
    ORDER BY comppct_r DESC;

    Query Results
    cokey | majcompflag | comppct_r | wei | weg | tfact
    ---------------+-------------+-----------+-----+-----+-------
    467038:646635 | Yes | 85 | 48 | 6 | 5
    467038:646636 | No | 5 | | |
    467038:646638 | No | 4 | | |
    467038:646640 | No | 2 | | |
    467038:646639 | No | 2 | | |
    467038:646637 | No | 2 | | |
  • Create a new table 'yolo_comp' using the component table, for a specific survey area (Yolo County), with components sorted by their component percentages.
    CREATE TABLE yolo_comp AS
    SELECT * FROM component
    WHERE areasymbol = 'ca113'
    ORDER BY mukey, comppct_r DESC ;

     
    The resulting table can be copied to a CSV file (in the current working directory) like this: \copy yolo_comp TO 'yolo_comp.csv' CSV HEADER

     
    If you are done with the table, remove it with the following SQL: DROP TABLE yolo_comp ;

Identifying the Largest Components

 
Overview:
Two methods for the selection of the largest components (based on the comppct_r column) within map units. This approach to selecting a single component per map unit is appropriate in situations where a single representative feature or property is sought. The partitioning of components within a map unit is closely related to the map unit type: complex, association, consociation or an "undifferentiated group". A breakdown of the number of components per each map unit type is summarized by the following query (26400 map units / 45971 major components):

SELECT mapunit.mukind, round(avg(n_components)) AS avg, min(n_components), max(n_components)
FROM mapunit JOIN
        (
        SELECT mapunit.mukey, count(component.mukey) AS n_components
        FROM
        mapunit JOIN component
        ON mapunit.mukey = component.mukey
        WHERE component.majcompflag = 'Yes' AND mapunit.mukind IS NOT NULL
        GROUP BY mapunit.mukey
        ) AS a
ON a.mukey = mapunit.mukey
GROUP BY mapunit.mukind ;

Query Results
mukind | avg | min | max
------------------------+-----+-----+-----
Consociation | 1 | 1 | 3
Complex | 2 | 1 | 4
Association | 3 | 1 | 4
Undifferentiated group | 2 | 1 | 4

 
Method 1: Filtering component percentages with the DISTINCT keyword.

  1. Use of the SELECT DISTINCT ON operator. Note that the ordering is done before the DISTINCT filter is applied, resulting in the largest (by component percentage) component. Note that results from ties are unpredictable.
    • SELECT DISTINCT ON (mukey) mukey, cokey, comppct_r, compname, taxorder
      FROM component
      WHERE areasymbol = 'ca654' AND majcompflag = 'Yes'
      ORDER BY mukey, comppct_r DESC ;

      Query Results
      mukey | cokey | comppct_r | compname | taxorder
      ---------+----------------+-----------+------------+-----------
      1487066 | 1487066:638252 | 80 | Auberry | Alfisols
      1487067 | 1487067:638274 | 80 | Auberry | Alfisols
      1487068 | 1487068:638383 | 85 | Crouch | Mollisols
      1487069 | 1487069:632318 | 85 | Cajon | Entisols
      1487070 | 1487070:632372 | 85 | Excelsior | Entisols
      1487071 | 1487071:632519 | 85 | Kimberlina | Entisols
      1487072 | 1487072:632642 | 85 | Nord | Mollisols
      1487076 | 1487076:632832 | 85 | Wasco | Entisols
      1487077 | 1487077:632866 | 85 | Whitewolf | Entisols
      464171 | 464171:640646 | 85 | Academy | Alfisols
    • We can inspect possible ties or other sources of error by looking at the smallest components.

      SELECT * FROM
              (
              SELECT DISTINCT ON (mukey) mukey, cokey, comppct_r, compname, taxorder
              FROM component
              WHERE areasymbol = 'ca654' AND majcompflag = 'Yes'
              ORDER BY mukey, comppct_r DESC
              ) AS a
      ORDER BY a.comppct_r ASC ;

      Query Results
      mukey | cokey | comppct_r | compname | taxorder
      ---------+----------------+-----------+---------------------+-------------
      464432 | 464432:1380237 | 10 | Ramona | Alfisols
      464184 | 464184:640686 | 30 | Ahwahnee | Alfisols
      464211 | 464211:640756 | 30 | Auberry | Alfisols
      464210 | 464210:640752 | 35 | Auberry | Alfisols
      464529 | 464529:641600 | 35 | Vista | Inceptisols
      464530 | 464530:641604 | 35 | Vista | Inceptisols
      464531 | 464531:641609 | 35 | Rock outcrop |

 
Finding the largest component (based on comppct_r) per mukey

  1. Identify how many map units there are in a given region or survey area. In this case San Joaquin County; areasymbol = 'ca077'.
    SELECT count(mukey) FROM mapunit WHERE areasymbol = 'ca113';

    Query Results
    count
    -------
    186
  2. Identify which map units contain components where a conventional, select largest component percentage type approach will not work. This commonly happens when there are several major components with the same comppct_r value, within the same map unit. Any returned mukey values need to be evaluated.
    SELECT component.mukey, count(component.mukey)
    FROM component
            JOIN
            (
            SELECT mukey, max(comppct_r) AS comppct FROM component WHERE majcompflag = 'Yes' AND areasymbol = 'ca077' GROUP BY mukey
            ) AS a
    ON component.comppct_r = a.comppct AND component.mukey = a.mukey
    GROUP BY component.mukey
    HAVING count(component.mukey) > 1 ;

    Query Results
    mukey | count
    --------+-------
    462043 | 2
    462068 | 2
  3. Evaluation of the two "trouble" map units identified above. Note the use of the where mukey in ('462043', '462068') syntax.
    SELECT mukey, cokey, comppct_r, compname, taxorder
    FROM component
    WHERE majcompflag = 'Yes' AND mukey IN ('462043', '462068')
    ORDER BY mukey, comppct_r DESC ;

    Query Results
    mukey | cokey | comppct_r | compname | taxorder
    --------+---------------+-----------+------------+-----------
    462043 | 462043:634006 | 45 | Dumps |
    462043 | 462043:634007 | 45 | Tailings |
    462068 | 462068:634159 | 30 | Honker | Alfisols
    462068 | 462068:634160 | 30 | Vallecitos | Alfisols
    462068 | 462068:634161 | 25 | Gonzaga | Mollisols

Profile water storage as calculated from SSURGO

 
Overview
This example illustrates a component-percent weighted mean water storage, based on total profile storage values. The comppct_r column is used as a weight to adjust the influence of the profile water storage values for each component: larger weights (larger components) result in more influence. When performing weighted means, use care in the selection of an appropriate weighting parameter. A selection of a weighting variable not direclty related to the property which is being averaged can result in an effect known as Simpson's paradox.

 
Some background on Soil Water Holding Capacity and Irrigation Management.

 
The general formula for computing profile water storage is defined as follows. Note that the awc_r column in the SSURGO database (chorizon table) is pre-adjusted to compensate for coarse fragments (see relevant SSURGO metadata below).

 
Calculate the water storage within a given component
component_whc = profile_sum_awc = sum( (hzdepb_r - hzdept_r) * awc_r )

 
Calculate the weighted mean value of the profile water storage, for a given map unit
weighted mean = sum(weights * x) / sum(weights)
mapunit_whc = sum( comppct_r * component_whc ) / sum( comppct_r )
Note: weights corresponding to non-soil components must be filtered out

 
From the SSURGO 2.1 table definitions:

Column Label: AWC - Representative Value (awc_r)
The amount of water that an increment of soil depth, inclusive of fragments, can store that is available to plants. AWC is expressed as a volume fraction, and is commonly estimated as the difference between the water contents at 1/10 or 1/3 bar (field capacity) and 15 bars
(permanent wilting point) tension and adjusted for salinity, and fragments.
  1. Reduce to 1:1 relationship with cokey by aggrigating AWC horizon data in the chorizon table, and linking to the component table.
     SELECT mukey, compname, comppct_r, a.* FROM component
    JOIN  
            (
            SELECT cokey, sum( (hzdepb_r - hzdept_r) * awc_r) AS component_whc, sum((hzdepb_r - hzdept_r)) AS depth
            FROM chorizon  WHERE areasymbol = 'ca113'
            GROUP BY cokey
            ) AS a
    ON component.cokey = a.cokey
    WHERE component.areasymbol = 'ca113'
    ORDER BY mukey ;

    Query Results
    mukey | compname | comppct_r | cokey | component_whc | depth
    --------+---------------------+-----------+----------------+---------------+-------
    459225 | Millsholm | 30 | 459225:624008 | 6.82 | 58
    459225 | Dibble | 55 | 459225:624007 | 12.36 | 86
    459226 | Millsholm | 40 | 459226:624013 | 6.82 | 58
    459226 | Dibble | 45 | 459226:624012 | 12.36 | 86
    459227 | Millsholm | 30 | 459227:624019 | 6.82 | 58
    459227 | Dibble | 60 | 459227:624018 | 12.36 | 99
    ...
  2. Reduce to 1:1 relationship with mukey by aggreigating based on component percent:
    SELECT mukey,
    -- note that weights from non-soil components must be removed
    -- otherwise, weighted mean values will be too low
    SUM(comppct_r * component_whc) / SUM(comppct_r) AS comppct_weighted_whc_cm
    FROM component
    JOIN
            (
            SELECT cokey, sum( (hzdepb_r - hzdept_r) * awc_r) AS component_whc,
            sum((hzdepb_r - hzdept_r)) AS depth
            FROM chorizon
            WHERE areasymbol = 'ca113'
            GROUP BY cokey
            ) AS a
    USING (cokey)
    WHERE component.areasymbol = 'ca113'
    -- filter out components that are missing soils data
    AND a.component_whc IS NOT NULL
    GROUP BY mukey ;

    Query Results
    mukey | comppct_weighted_whc_cm
    --------+-------------------------
    459225 | 10
    459226 | 10
    459227 | 11
    ...
  3. Link to polygons and setup access permissions:
    -- create the new table with both geometry and attributes
    CREATE TABLE yolo_whc AS
    SELECT  ogc_fid, wkb_geometry AS wkb_geometry, b.mukey, b.comppct_weighted_whc_cm
    FROM mapunit_poly
    -- use LEFT JOIN to include non-soil polygons in the result set
    -- alternatively use JOIN to ignore non-soil polygons
    LEFT JOIN
            (
            SELECT mukey,
            -- note that weights from non-soil components must be removed
            -- otherwise, weighted mean values will be too low
            SUM(comppct_r * component_whc) / SUM(comppct_r) AS comppct_weighted_whc_cm
            FROM component
            JOIN
              (
              SELECT cokey, sum( (hzdepb_r - hzdept_r) * awc_r) AS component_whc,
              sum((hzdepb_r - hzdept_r)) AS depth
              FROM chorizon
              WHERE areasymbol = 'ca113'
              GROUP BY cokey
              ) AS a
            USING (cokey)
            WHERE component.areasymbol = 'ca113'
            -- filter out components that are missing soils data
            AND a.component_whc IS NOT NULL
            GROUP BY mukey
            ) AS b
    -- JOIN constraint
    USING (mukey)
    -- optional constraint to limit geometry search in mapunit_poly table
    WHERE mapunit_poly.areasymbol = 'ca113' ;

     
    Create indexes and register the new geometry:

    -- create attribute and spatial index:
    CREATE UNIQUE INDEX yolo_whc_idx ON yolo_whc (ogc_fid) ;
    CREATE INDEX yolo_whc_spatial_idx ON yolo_whc USING gist (wkb_geometry gist_geometry_ops);

    -- register in geometry_columns table:
    INSERT INTO geometry_columns VALUES ('','public','yolo_whc','wkb_geometry',2,9001,'POLYGON');

    -- optional:
    -- fix permissions
    -- GRANT SELECT on table yolo_whc to [user] ;

    -- cleanup
    vacuum analyze yolo_whc  ;

    Yolo County Water Holding Capacity Map: Profile total water holding capacity as computed by component percentage weighted average.Yolo County Water Holding Capacity Map: Profile total water holding capacity as computed by component percentage weighted average.

  4. Dump to shapefile with ogr2ogr:
     ogr2ogr ywhc.shp PG:"dbname=ssurgo_combined user=xxxx password=xxxx" yolo_whc

     
    More examples of exporting data can be found here.

Soil Texture Example

 
Premise

Compute a series of weighted-average soil texture fractions (sand, silt, clay), for every component, of every map unit in Yolo County. These values will be further weighted by the spatial distribution of each map unit.

CREATE TABLE yolo_wt_mean_texture AS
-- join with polygons, and compute areas weights
SELECT mapunit_poly.mukey,
sum(ST_Area(wkb_geometry)) / (SELECT ST_Area(wkb_geometry) FROM mapunit_bound_poly WHERE areasymbol = 'ca113') AS area_wt,
sand, silt, clay
FROM
mapunit_poly
JOIN
        (
        -- compute component percent weighted mean
        SELECT mukey,
        sum(comppct_r * sand) / sum(comppct_r) AS sand,
        sum(comppct_r * silt) / sum(comppct_r) AS silt,
        sum(comppct_r * clay) / sum(comppct_r) AS clay
        FROM
        component
        JOIN
                (
                -- compute hz thickness weighted mean
                SELECT cokey,
                sum((hzdepb_r - hzdept_r) * sandtotal_r) / sum(hzdepb_r - hzdept_r) AS sand,
                sum((hzdepb_r - hzdept_r) * silttotal_r) / sum(hzdepb_r - hzdept_r) AS silt,
                sum((hzdepb_r - hzdept_r) * claytotal_r) / sum(hzdepb_r - hzdept_r) AS clay
                FROM chorizon
                WHERE sandtotal_r IS NOT NULL
                AND silttotal_r IS NOT NULL
                AND claytotal_r IS NOT NULL
                AND areasymbol = 'ca113'
                GROUP BY cokey
                ) AS co_agg
        ON component.cokey = co_agg.cokey
        GROUP BY component.mukey
        ) AS mu_agg
ON mapunit_poly.mukey = mu_agg.mukey
GROUP BY mapunit_poly.mukey, sand, silt, clay;

 
Simple Visualization with R

# dump the data:
# echo 'select * from yolo_wt_mean_texture' |  psql -A -F "," -U xxx -h xxx dbname > temp/yolo_texture.csv
#
# remember to trim off the last line!

# load some libs:
library(plotrix)

# read in the data
x <- read.csv('yolo_texture.csv')

# simple soil texture, with symbol size weighted by area weight
soil.texture(x[,3:5], cex=sqrt(50*x$area_wt), pch=16, col.symbol=rgb(65,105,225, alpha=100, max=255),
show.lines=T, show.names=T, col.lines='black', col.names='black', main='Yolo County Soil Textures')

triax.points(cbind(weighted.mean(x$sand, x$area_wt), weighted.mean(x$silt, x$area_wt), weighted.mean(x$clay, x$area_wt)),
col.symbols='orange', pch=16, cex=2)

Yolo County Common Soil Textures: Blue symbols mark common soil textures and their relative size denotes abundance. The orange symbol marks the weighted average soil texture for all of Yolo County.Yolo County Common Soil Textures: Blue symbols mark common soil textures and their relative size denotes abundance. The orange symbol marks the weighted average soil texture for all of Yolo County.

Identification of Dated Surfaces via Soil Series

East Side Alluvial FormationsEast Side Alluvial Formations

 
Overview
A simple association between dated landforms and soil series name [1] was used to extract soil polygons from a composite soil survey database.

Soil Series Associated Formation Approximate Age (1000 yrs ago)
Redding Laguna 1600 - 2000 kya
Corning Laguna 1600 - 2000 kya
Keyes Laguna 1600 - 2000 kya
Whitney Turlock Lake 500 - 700 kya
Montpellier Turlock Lake 500 - 700 kya
Rocklin Turlock Lake 500 - 700 kya
Snelling Riverbank 100 - 300 kya
San Joaquin Riverbank 100 - 300 kya
Exiter Riverbank 100 - 300 kya
Madera Riverbank 100 - 300 kya
Hanford Modesto 10 - 40 kya
Grangeville Holocene < 10 kya
  1. Create a soil series - dated landform lookup table:

    -- create a lookup table
    CREATE TABLE dated_landforms (
    soil_series varchar(20),
    formation varchar(20),
    approx_age varchar(30)
    );
    -- populate table
    INSERT INTO dated_landforms VALUES ('Redding','Laguna','1600 - 2000 kya') ;
    INSERT INTO dated_landforms VALUES ('Corning','Laguna','1600 - 2000 kya') ;
    INSERT INTO dated_landforms VALUES ('Keyes','Laguna','1600 - 2000 kya') ;
    INSERT INTO dated_landforms VALUES ('Whitney','Turlock Lake','500 - 700 kya') ;
    INSERT INTO dated_landforms VALUES ('Montpellier','Turlock Lake','500 - 700 kya') ;
    INSERT INTO dated_landforms VALUES ('Rocklin','Turlock Lake','500 - 700 kya') ;
    INSERT INTO dated_landforms VALUES ('Snelling','Riverbank','100 - 300 kya') ;
    INSERT INTO dated_landforms VALUES ('San Joaquin','Riverbank','100 - 300 kya') ;
    INSERT INTO dated_landforms VALUES ('Exiter','Riverbank','100 - 300 kya') ;
    INSERT INTO dated_landforms VALUES ('Madera','Riverbank','100 - 300 kya') ;
    INSERT INTO dated_landforms VALUES ('Hanford','Modesto','10 - 40 kya') ;
    INSERT INTO dated_landforms VALUES ('Grangeville','Holocene','< 10 kya') ;
  2. Select map units (mukey) by suitible series concepts, associated with major components. Note that the DISTINCT ON (mukey) ... ORDER BY mukey, comppct_r DESC pattern can be used to select the largest component for each map unit. Pattern matching is used to safely join variants with the soil series names in our look-up table: i.e. 'San Joaquin variant' will match 'San Joaquin'.

    -- keep only the largest formation (by total component percent) within each map unit key
    SELECT DISTINCT ON (mukey) mukey, formation, sum(comppct_r) AS formation_pct
    FROM component
    JOIN
    dated_landforms
    -- use pattern matching to also include variants
    ON compname ~~* lower(soil_series || '%') = 't'
    -- subset to select survey areas on the east side of the valley
    AND component.areasymbol IN ('ca654', 'ca651', 'ca649', 'ca644', 'ca648', 'ca077')
    AND majcompflag = 'Yes'
    -- combine components of the same formation
    GROUP BY mukey, formation
    -- use in conjunction with the DISTINCT statement to keep only the largest
    ORDER BY mukey, formation_pct DESC;
  3. Create a new classified table called east_side_all. This query involves 6 survey areas, 11423 polygons, and requires about 11 seconds to complete.

    CREATE TABLE east_side_all AS
    -- select geom column and the feature id
    -- along with our formation names
    SELECT wkb_geometry AS wkb_geometry, ogc_fid, a.*
    FROM mapunit_poly
    JOIN
    (
    SELECT DISTINCT ON (mukey) mukey, formation, sum(comppct_r) AS formation_pct
    FROM component
    JOIN
    dated_landforms
    ON compname ~~* lower(soil_series || '%') = 't'
    AND component.areasymbol IN ('ca654', 'ca651', 'ca649', 'ca644', 'ca648', 'ca077')
    AND majcompflag = 'Yes'
    GROUP BY mukey, formation
    ORDER BY mukey, formation_pct DESC
    ) AS a
    ON mapunit_poly.mukey = a.mukey
    -- limit polygon selection by survey area
    AND mapunit_poly.areasymbol IN ('ca654', 'ca651', 'ca649', 'ca644', 'ca648', 'ca077') ;
  4. Index, register geometry, and setup permissions
     -- indexing
    CREATE INDEX east_side_all_fid_idx ON east_side_all  (ogc_fid) ;
    CREATE INDEX east_side_all_gis_idx ON east_side_all USING gist (wkb_geometry gist_geometry_ops) ;
    &nbsp;
    -- register geometry
    INSERT INTO geometry_columns VALUES ('','public','east_side_all','wkb_geometry',2,9001,'POLYGON');
    &nbsp;
    -- permissions
    GRANT SELECT ON TABLE east_side_all TO soil  ;
  5. Tabulate acreage for each formation. (approximately 1/3 second to complete)
     
    SELECT * FROM
            (
            SELECT round(sum(area(wkb_geometry)) * 0.000247) AS area_ac, formation
            FROM east_side_all
            GROUP BY formation
            ) AS a
    ORDER BY a.area_ac DESC;

    area_ac formation
    295868 Riverbank
    253424 Modesto
    151149 Turlock Lake
    121085 Laguna
    96981 Holocene

    (5 rows)

  6. Dump to local file in SHP format
     ogr2ogr east_side_all.shp PG:"dbname=ssurgo_combined user=xxxx password=xxxx host=xxxx" east_side_all

 
References:

  1. Smith, D.W. & Verrill, W.L. Witham, C.; Bauder, E.; Belk, D.; Ferren Jr., W. & Ornduff, R. (ed.) Vernal Pool-Soil-Landform Relationships in the Central Valley, California 1998, 15-23

Seasonally Wet Soils and Shrink-Swell Potential

PostGIS: ssurgo example- wet polygons
Example 1 map
PostGIS: ssurgo example- shrink swell potential
Example 2 map

 
Example 1: The location of seasonaly wet soils via two methods: hydricrating and USDA Soil Taxonomy interpretation.

-- optionally link polygons here...
-- compute the percent of each map unit that contains components that may be seasonally wet
SELECT mukey, wet_flag, sum(comppct_r) AS pct_mu
FROM
        (
        SELECT mukey, cokey, comppct_r,
        -- slightly less restrictive than hydricrating alone
        CASE WHEN ((taxclname ~~* '%aqu%') = 't') OR (hydricrating = 'Yes') then 1 ELSE 0 END AS wet_flag
        FROM component  
        -- subset to Yolo County
        WHERE areasymbol = 'ca113'
        ) AS yolo_wet_components
-- only keep map unit with some component that meets the criteria
WHERE wet_flag = 1
-- aggregate by map unit, wet_flag
GROUP BY yolo_wet_components.mukey, yolo_wet_components.wet_flag
ORDER BY mukey;

 

 
Example 1.1 Extract a list of wet components, and sum area based on component (series) name

-- compute estimated, total acreage of seasonally wet map units in Yolo Co.
SELECT mapunit.musym, mapunit.mukey, mapunit.muname,
round(sum(ST_Area(wkb_geometry) * (yolo_mu_data.pct_mu::numeric/100.0) * 0.000247)::numeric, 2) AS wet_area_ac
FROM
mapunit_poly
JOIN
        (
        -- compute the percent of each map unit that contains components that may be seasonally wet
        SELECT mukey, wet_flag, sum(comppct_r) AS pct_mu
        FROM
                (
                SELECT mukey, cokey, comppct_r,
                -- slightly less restrictive than hydricrating alone
                CASE WHEN ((taxclname ~~* '%aqu%') = 't') OR (hydricrating = 'Yes') then 1 ELSE 0 END AS wet_flag
                FROM component  
                -- subset to Yolo County
                WHERE areasymbol = 'ca113'
                ) AS yolo_wet_components
        -- only keep map unit with some component that meets the criteria
        WHERE wet_flag = 1
        -- aggregate by map unit, wet_flag
        GROUP BY yolo_wet_components.mukey, yolo_wet_components.wet_flag
        ) AS yolo_mu_data
ON mapunit_poly.mukey = yolo_mu_data.mukey
JOIN mapunit
ON mapunit.mukey = yolo_mu_data.mukey
-- aggregate by map unit
GROUP BY mapunit.mukey, mapunit.muname, mapunit.musym
ORDER BY wet_area_ac DESC;

 musym | mukey  |                             muname                             | wet_area_ac 
-------+--------+----------------------------------------------------------------+-------------
 Sc    | 459268 | Sacramento clay                                                |    35710.90
 Mf    | 459244 | Marvin silty clay loam                                         |    18877.18
 Sg    | 459272 | Sacramento soils, flooded                                      |    12273.51
 Cn    | 459219 | Clear Lake soils, flooded                                      |    11665.84
 Cc    | 459216 | Capay soils, flooded                                           |    11029.83
 Sv    | 459288 | Sycamore complex, drained                                      |     8410.41
 St    | 459286 | Sycamore silty clay loam, drained                              |     6900.77
 Ck    | 459218 | Clear Lake clay                                                |     6737.88
 Sa    | 459266 | Sacramento silty clay loam                                     |     6014.48
 Sp    | 459283 | Sycamore silt loam, drained                                    |     5537.18
 Sw    | 459289 | Sycamore complex, flooded                                      |     5041.96
 Wb    | 459301 | Willows clay                                                   |     4950.34
 Ss    | 459285 | Sycamore silty clay loam                                       |     4859.59
 Pb    | 459254 | Pescadero silty clay, saline-alkali                            |     4700.77
 So    | 459282 | Sycamore silt loam                                             |     3953.42
 Rh    | 459262 | Riverwash                                                      |     3698.88
 Pc    | 459255 | Pescadero soils, flooded                                       |     3589.01
 Tb    | 459292 | Tyndall very fine sandy loam                                   |     3299.03
 Ob    | 459252 | Omni silty clay                                                |     3174.44

[...]

 
Example 2: Classify soils according to shrink-swell capacity of the top 1 meter of soil, weighted by horizon thickness and component percent.

-- set a lower boundary for the query
\SET lwr_bdy 100

-- add a class label, based on NRCS guidelines
SELECT mapunit.musym, mapunit.muname, mapunit.muacres,
round(mu_wt_lep::numeric, 2) AS lep,
CASE WHEN mu_wt_lep < 3 THEN 'Low'
WHEN mu_wt_lep >= 3 AND mu_wt_lep < 6 THEN 'Moderate'
WHEN mu_wt_lep >= 6 AND mu_wt_lep < 9 THEN 'High'
WHEN mu_wt_lep >= 9 THEN 'Very High'
END AS lep_class
FROM
        (
        -- compute map unit lep, weighted by component percent, to set depth
        SELECT component.mukey,
        sum(component.comppct_r * co_wt_mean_lep) / sum(component.comppct_r) AS mu_wt_lep
        FROM
                (
                -- compute a horizon-thickness weighted mean lep to a set depth
                SELECT cokey, sum(thick * lep_r) / sum(thick) AS co_wt_mean_lep
                FROM
                        (
                        -- compute horizon thickness, but only to a set depth
                        SELECT cokey, hzdept_r, hzdepb_r, lep_r,
                        CASE WHEN hzdepb_r > :lwr_bdy THEN (:lwr_bdy - hzdept_r)
                        ELSE (hzdepb_r - hzdept_r) END AS thick
                        FROM chorizon
                        WHERE areasymbol = 'ca113'
                        AND lep_r IS NOT NULL
                        AND hzdept_r <= :lwr_bdy
                        ) AS hz_lep
                GROUP BY cokey
                ) AS co_lep
        JOIN component
        ON co_lep.cokey = component.cokey
        GROUP BY mukey
        ) AS mu_lep
JOIN mapunit
ON mu_lep.mukey = mapunit.mukey
ORDER BY muacres DESC, lep DESC;

 musym |                              muname                              | muacres | lep  | lep_class 
-------+------------------------------------------------------------------+---------+------+-----------
 Ya    | Yolo silt loam                                                   |   39698 | 2.52 | Low
 Sc    | Sacramento clay                                                  |   34886 | 7.50 | High
 Ca    | Capay silty clay                                                 |   33465 | 7.50 | High
 MrG2  | Millsholm rocky loam, 15 to 75 percent slopes, eroded            |   30118 | 1.50 | Low
 Rg    | Rincon silty clay loam                                           |   24580 | 6.36 | High
 BrA   | Brentwood silty clay loam, 0 to 2 percent slopes                 |   23045 | 7.50 | High
 CtD2  | Corning gravelly loam, 2 to 15 percent slopes, eroded            |   22080 | 5.34 | Moderate
 Mf    | Marvin silty clay loam                                           |   20970 | 6.60 | High
 DaF2  | Dibble clay loam, 30 to 50 percent slopes, eroded                |   18612 | 7.11 | High
 SmE2  | Sehorn-Balcom complex, 15 to 30 percent slopes, eroded           |   17794 | 6.17 | High
 TaA   | Tehama loam, 0 to 2 percent slopes                               |   16622 | 3.75 | Moderate
 BdF2  | Balcom-Dibble complex, 30 to 50 percent slopes, eroded           |   16405 | 5.73 | Moderate
 SmD   | Sehorn-Balcom complex, 2 to 15 percent slopes                    |   16117 | 6.50 | High
 BaF2  | Balcom silty clay loam, 30 to 50 percent slopes, eroded          |   12637 | 4.50 | Moderate
 Sg    | Sacramento soils, flooded                                        |   12258 | 6.27 | High
 Cn    | Clear Lake soils, flooded                                        |   11666 | 6.92 | High
 SmF2  | Sehorn-Balcom complex, 30 to 50 percent slopes, eroded           |   11226 | 6.33 | High
 Cc    | Capay soils, flooded                                             |   11030 | 7.50 | High
 Sv    | Sycamore complex, drained                                        |    9241 | 4.18 | Moderate
 Ms    | Myers clay                                                       |    8938 | 7.50 | High
 PfF2  | Positas gravelly loam, 30 to 50 percent slopes, eroded           |    7920 | 5.34 | Moderate
 St    | Sycamore silty clay loam, drained                                |    7839 | 4.50 | Moderate
 Ck    | Clear Lake clay                                                  |    6946 | 7.50 | High
 Ra    | Reiff very fine sandy loam                                       |    6847 | 1.50 | Low

[...]

Making Database Diagrams with PostgreSQL Autodoc

http://www.rbt.ca/autodoc/index.html


postgresql_autodoc -d db_name

see dot output.
dot -Tpng -o output.png db_name.dot 

PROJ: forward and reverse geographic projections

Data points collected in the field via GPS, or in the office via maps can often come from a variety of projections and or datum/ellipsoids. The PROJ tools and libraries can be used to perform forward and reverse projection on lists of points stored in a text file. More information on PROJ can be found here: http://trac.osgeo.org/proj/ . Many applications that perform forward and inverse projection operations rely on the Proj library for the actual transformatoin of coordinates. Some examples include GDAL/OGR and GRASS, which provide a convenient interface for converting large datasets from one coordinate system to another. For problems see the FAQ. Tests confirm that datum transforms within the continental US are nearly identical with results from NADCON.

Command names are in boldface type, input files are red, and output files are in green.


Forward and Inverse Projections



Conversions Between Coordinate Systems


GDAL and OGR: geodata conversion and re-projection tools

 
Overview

Operations on geographic data are most efficient when the input files have identical spatial parameters: i.e. coordinates system and datum / ellispoid. The GDAL/OGR tools and libraries can be used to "reproject" downloaded geographic data from one coordinate system to another, both for vector and raster data. More information on this software can be found here: http://www.remotesensing.org/gdal/index.html The following examples illustrate some of the ways in which the GDAL/OGR tools can be used to manipulate the spatial parameters of some common types of geographic data: SSURGO (ESRI Shapefile format) and aerial imagery (GeoTiff format).

 
How Does it Compare with ArcToolbox?

A SSURGO boundary polygon was used in a quick test to see how similar the results of a vector re-projection done by ArcMap and OGR were. The input vector (GCS NAD83) was projected to UTM zone 11 NAD83 by both OGR and ArcMap 9.0, and then imported into GRASS. Both vectors were then exported as ascii, for simple access to the line vertex coordinates. A simple AWK script was used to numerically compare the differences in both X and Y coordinates:

 paste arcmap.ascii ogr.ascii | \
 awk '
 /\./{print ($1-$3), ($2-$4) ; x+=($1-$3) ; y+=($2-$4); n++ } 
 END{print "avg dx: "(x/n), "avg dy: "(y/n), "n: "n}
     '

The resulting average dx and dy values were: dx: 2.98838e-06 m and dy: -1.23948e-07 m with 5035 samples: looks good to me. Next time, a datum shift may be more telling of any subtle differences to be aware of. Update: NAD83 -> NAD27 Datum shift experiment looks like datum shifts are comparible as well.

Examples:
Command names are in boldface type, input files are red, and output files are in green. Figures are provided for additional context, and are referenced by input/output file name in the example code.

ssurgo utm example
ssurgo_utm.shp
ssurgo ScA
ssurgo_ScA.shp

Vector Operations

image CA SPC IV

884084.tif

image utm zone 10

884084-utm.tif



Raster Operations [raster tutorial]

 
Reading MrSid format files details






R: advanced statistical package

 
About R
R is a general-purpose, command-line based, environment for working with data. R is based on a functional approach to working with vectors and matrices of data. R is a convenient environment for processing, analyzing, and plotting data.

 
Soils-Related R Packages
The 'aqp' (Algorithms for Quantitative Pedology) package was developed to facilitate numerical extensions to classical studies of soil geography, genesis and classification. [CRAN] [R-Forge]

 
R in the News

 
Getting Started

 
Searching for Information

 
R with Geographic Data

 
Misc. Articles

Access Data Stored in a Postgresql Database

 
Overview
Perform some temporal aggregation (by day and by week) of the amount of data entry completed in Postgresql, and plot the results in R. See resulting figure at the bottom of the page. Note that this requires the Rdbi and RdbiPgSQL packages. Hints on installing these packages can be found on this page...

 
Weekly Aggregation hints from the psql manual page

 SELECT week, count(week) AS entered
FROM
(
SELECT pedon_id, creation_date, extract( week FROM creation_date) AS week
FROM description
ORDER BY creation_date ASC
) AS a
GROUP BY a.week
ORDER BY week;

 
Daily Aggregation hints from the psql manual page

SELECT doy, count(doy) AS entered
FROM
(
SELECT pedon_id, creation_date, extract( doy FROM creation_date) AS doy
FROM description
ORDER BY creation_date ASC
) AS a
GROUP BY a.doy
ORDER BY doy;

 
R Example

 ##### load the samme data in from PgSQL #####

library(Rdbi)
library(RdbiPgSQL)

# conn becomes an object which contains the DB connection:
conn <- dbConnect(PgSQL(), host="localhost", dbname="xxx", user="xxx", password="xxx")
# see if the connection works (should report the list of table(s) if table(s) are existing):
# dbListTables(conn)

## create an object which contains the SQL query:
query <- dbSendQuery(conn, "select pedon_id, hz_number, name, top, bottom, ((bottom - top)/2 + top) as avgdepth, matrix_wet_color_hue, matrix_wet_color_value, matrix_wet_color_chroma, matrix_dry_color_hue, matrix_dry_color_value, matrix_dry_color_chroma from horizon  order by pedon_id, hz_number ")

# fetch data according to query:
x <- dbGetResult(query)

# create an object which contains the SQL query:
query <- dbSendQuery(conn, "select doy, count(doy) as entered from (select pedon_id, creation_date, extract( doy from creation_date) as doy from description order by creation_date asc ) as a group by a.doy order by doy;")

# fetch data according to query:
y <- dbGetResult(query)

# setup plot environment
par(mfrow=c(2,1))

# plot cumulative progress, by week
plot(x$week, cumsum(x$entered), type='b', xlab='Week', ylab='Pedon Forms Completed', main='Weekly Progress')

# plot cumulative progress, by day
plot(y$doy, cumsum(y$entered), type='b', xlab='Day of Year', ylab='Pedon Forms Completed', main='Daily Progress')


Pedon entry progressPedon entry progress

Additive Time Series Decomposition in R: Soil Moisture and Temperature Data

Decagon Sensors: EC-5 (moisture) and ECT (temperature)

 
Premise
Simple demonstration of working with time-series data collected from Decagon Devices soil moisture and temperature sensors. These sensors were installed in a potted plant, that was semi-regularly watered, and data were collected for about 80 days on an hourly basis. Several basic operations in R are demonstrated:

  • reading raw data in CSV format
  • converting date-time values to R's date-time format
  • applying a calibration curve to raw sensor values
  • initialization of R time series objects
  • seasonal decomposition of additive time series (trend extraction)
  • plotting of decomposed time series, ACF, and cross-ACF

 
Process the raw sensor values with standard calibrations

## data from office plant: in potting soil
# raw data dump -- need to convert datetime + values:
x1 <- read.csv('office_plant_2.csv', head=FALSE)

# datetime is seconds from jan 1st 2000
t_0 <- as.POSIXlt(strptime('2000-01-01 00:00:00', format='%Y-%m-%d %H:%M:%S'))

# calibration for potting soil
raw_to_vwc <- function(d) {vwc <- (d * 0.00119) - 0.401 ; vwc }

# calibration for deg C
raw_to_temp <- function(d) {t <- log( (4095/d) - 1 ) ; t_c <-  25.02 + t * (-22.84 + t * (1.532 + (-0.08372 * t))) ; t_c}

# convert values
y1 <- data.frame(date=t_0 + x1$V1, m=raw_to_vwc(x1$V2), t=raw_to_temp(x1$V5))

# make a nice time axis
d.range <- range(y1$date)
d.list <- seq(d.range[1], d.range[2], by='week')

# note that there are several tricks here:
# stacking two plots that share an axis
# customized x-axis
# and manually adding a title with mtext()
par(mar = c(0.5, 4, 0, 1), oma = c(3, 0, 4, 0), mfcol = c(2,1))
plot(m ~ date, data=y1, type='l', ylab='VWC (EC-5 Sensor)', xaxt='n', las=2, cex.axis=0.75)
plot(t ~ date, data=y1, type='l', ylab='Deg. C (EC-T Sensor)', xaxt='n', las=2, cex.axis=0.75)
axis.POSIXct(at=d.list, side=1, format="%b-%d", cex.axis=0.75)
mtext('Potted Plant Experiment', outer=TRUE, line=2, font=2)

# save copy of raw data
dev.copy2pdf(file='raw_data.pdf')

 
Decompose each time series into additive components

# look at components of time series:
# we recorded measurements once and hour, so lets consider these data a on a daily-cycle
temp.ts <- ts(y1$t, freq=24)
vwc.ts <- ts(y1$m, freq=24)

# decompose additive time series with STL
# (Seasonal Decomposition of Time Series by Loess)
temp.stl <- stl(temp.ts, s.window=24)
vwc.stl <- stl(vwc.ts, s.window=24)

# these are referenced by day, so we need a new index for
# plotting meaningful dates on the x-axis
# generate the difference in days, from the first observations, at each date label
date.day_idx <- as.numeric((d.list - d.range[1]) / 60 / 60 / 24)

# note special syntax
par(mar = c(0, 4, 0, 3), oma = c(5, 0, 4, 0), mfcol = c(4,1), xaxt='n')
plot(temp.stl , main='Temperature (deg C)')
mtext(at=date.day_idx, text=format(d.list, "%b-%d"), side=1, cex=0.75)
dev.copy2pdf(file='temperature-ts_plot.pdf')

# note special syntax
par(mar = c(0, 4, 0, 3), oma = c(5, 0, 4, 0), mfcol = c(4,1), xaxt='n')
plot(vwc.stl , main='Volumetric Water Content')
mtext(at=date.day_idx, text=format(d.list, "%b-%d"), side=1, cex=0.75)
dev.copy2pdf(file='vwc-ts_plot.pdf')

Additive Time Series Decomposition: TemperatureAdditive Time Series Decomposition: Temperature

Additive Time Series Decomposition: Volumetric Water ContentAdditive Time Series Decomposition: Volumetric Water Content

 
Auto-Correlation Function (ACF)

# look at ACF: ind. time series, and cross-ACF
acf( ts.union(temp.ts, vwc.ts) )

# extract seasonal components from each sensor, union, and plot together
temp_vwc.ts <- ts.union(Temperature=temp.stl$time.series[,1], VWC=vwc.stl$time.series[,1])
plot(temp_vwc.ts, main='Seasonal Components', mar.multi= c(1, 5.1, 1, 1))

Soil Moisture and Temperature ACF: Auto-correlation function of each time series, and cross-ACF.Soil Moisture and Temperature ACF: Auto-correlation function of each time series, and cross-ACF.

 
Interesting Results
Variation in temperature with time dominated by diurnal fluctuations superposed over underlying fluctuations caused by building heating/cooling system. The magnitude of the diurnal cycle appears to be related to the moisture content- as expected due to high heat capacity of water. Diurnal variation in moisture values appears to account for less than < 2% absolute change in volumetric water content.

 
 
 

Aggregating SSURGO Data in R

 
Premise
SSURGO is a digital, high-resolution (1:24,000), soil survey database produced by the USDA-NRCS. It is one of the largest and most complete spatial databases in the world; and is available for nearly the entire USA at no cost. These data are distributed as a combination of geographic and text data, representing soil map units and their associated properties. Unfortunately the text files do not come with column headers, so a template is required to make sense of the data. Alternatively, one can use an MS Access template to attach column names, generate reports, and other such tasks. CSV file can be exported from the MS Access database for further use. A follow-up post with text file headers, and complete PostgreSQL database schema will contain details on implementing a SSURGO database without using MS Access.

If you happen to have some of the SSURGO tabular data that includes column names, the following R code may be of general interest for resolving the 1:many:many hierarchy of relationships required to make a thematic map.

 
This is the format we want the data to be in

    mukey     clay      silt      sand water_storage
   458581 20.93750 20.832237 20.861842     14.460000
   458584 43.11513 30.184868 26.700000     23.490000
   458593 50.00000 27.900000 22.100000     22.800000
   458595 34.04605 14.867763 11.776974     18.900000

 
So we can make a map like this
So we can make a map like this

 
Loading Data Into R

# need this for ddply()
library(plyr)

# load horizon and component data
chorizon <- read.csv('chorizon_table.csv')

# only keep some of the columns from the component table
component <- read.csv('component_table.csv')[,c('mukey','cokey','comppct_r')]

 
Function Definitions

# custom function for calculating a weighted mean
# values passed in should be vectors of equal length
wt_mean <- function(property, weights)
        {
        # compute thickness weighted mean, but only when we have enough data
        # in that case return NA
       
        # save indices of data that is there
        property.that.is.na <- which( is.na(property) )
                property.that.is.not.na <- which( !is.na(property) )
       
        if( length(property) - length(property.that.is.na) >= 1)
                prop.aggregated <- sum(weights[property.that.is.not.na] * property[property.that.is.not.na], na.rm=TRUE) / sum(weights[property.that.is.not.na], na.rm=TRUE)
        else
                prop.aggregated <- NA

        return(prop.aggregated)
        }

profile_total <- function(property, thickness)
        {
        # compute profile total
        # in that case return NA
       
        # save indices of data that is there
        property.that.is.na <- which( is.na(property) )
                property.that.is.not.na <- which( !is.na(property) )
       
        if( length(property) - length(property.that.is.na) >= 1)
                prop.aggregated <- sum(thickness[property.that.is.not.na] * property[property.that.is.not.na], na.rm=TRUE)
        else
                prop.aggregated <- NA

        return(prop.aggregated)
        }

# define a function to perfom hz-thickness weighted aggregtion
component_level_aggregation <- function(i)
        {

        # horizon thickness is our weighting vector
        hz_thick <- i$hzdepb_r - i$hzdept_r

        # compute wt.mean aggregate values
        clay <- wt_mean(i$claytotal_r, hz_thick)
        silt <- wt_mean(i$silttotal_r, hz_thick)
        sand <- wt_mean(i$sandtotal_r, hz_thick)
        # compute profile sum values
        water_storage <- profile_total(i$awc_r, hz_thick)

        # make a new dataframe out of the aggregate values
        d <- data.frame(cokey=unique(i$cokey), clay=clay, silt=silt, sand=sand, water_storage=water_storage)

        return(d)
        }

mapunit_level_aggregation <- function(i)
        {
        # component percentage is our weighting vector
        comppct <- i$comppct_r

        # wt. mean by component percent
        clay <- wt_mean(i$clay, comppct)
        silt <- wt_mean(i$silt, comppct)
        sand <- wt_mean(i$sand, comppct)
        water_storage <- wt_mean(i$water_storage, comppct)

        # make a new dataframe out of the aggregate values
        d <- data.frame(mukey=unique(i$mukey), clay=clay, silt=silt, sand=sand, water_storage=water_storage)

        return(d)
        }

 
Performing the Aggregation

# aggregate horizon data to the component level
chorizon.agg <- ddply(chorizon, .(cokey), .fun=component_level_aggregation, .progress='text')

# join up the aggregate chorizon data to the component table
comp.merged <- merge(component, chorizon.agg, by='cokey')

# aggregate component data to the map unit level
component.agg <- ddply(comp.merged, .(mukey), .fun=mapunit_level_aggregation, .progress='text')

# save data back to CSV
write.csv(component.agg, file='something.csv', row.names=FALSE)

Cluster Analysis 1: finding groups in a randomly generated 2-dimensional dataset

Cluster Analysis 1: 2 class example
Figure 1. Two class example
Cluster Analysis 1: 4 class example
Figure 2: Four class example
Cluster Analysis 1: 2 class example with 2-way fuzzy membership
Figure 3: 2-way fuzzy membership

 
Examples based on a random data set (see example code below), illustrating some of the differences between the K-means and C-means clustering methods as implemented in R. Next time an example with soil profile data collected from the Pinnacles National Monument soil survey efforts. An online version of the PINN soil survey will be available soon here.
 
Articles:

 
Example in R:

## load required packages:
 require(cluster)
 require(e1071)
## make a dateset with 5 populations
x <- matrix( c(
rnorm(50, mean=.3, sd=.5),
rnorm(50, mean=.16, sd=.1),
rnorm(50, mean=.4, sd=.3),
rnorm(50, mean=.6, sd=.2),
rnorm(50, mean=.2, sd=.2)
), ncol=2)

## load function membership() : see attached file at bottom of page
source('cluster_demo_function.R')

## run an example with 2, then 4 classes: See Figures 1 and 2
membership(x,2)
membership(x,4)

## two-way fuzzy membership illustrated with color: See Figure 3
## display 2-way fuzzy membership
plot(x, main="C-means: 2-way Fuzzy Membership", type="n", xlab="Variable 1", ylab="Variable 2")
points(cc$centers, col = c("red", "blue"), pch = 8, cex=2)
points(x, col = rgb(cc$membership[,1], 0 ,cc$membership[,2]) , cex=0.5, pch=16)

Color Functions

Sample functions and ideas for accessing the R built-in colors. Further examples on converting soil colors to RGB triplets, or for the selection of optimal colors for a thematic map please see the examples linked at the bottom of this page. An excellent discussion on the use of color for presenting scientific data is presented in this paper by Zeileis, Achim and Hornik, Kurt.

R Color Selection: Simple figure illustrating the layout() function to create a plot of the built-in R colors palettes.R Color Selection: Simple figure illustrating the layout() function to create a plot of the built-in R colors palettes.

Simple Color Display

#make a color wheel
pie(rep(1,12), col=rainbow(12))


#make a list of the common color palettes:
demo.pal <- function(n, border = if (n<32) "light gray" else NA,
main = paste("color palettes;  n=",n),
ch.col = c("rainbow(n, start=.7, end=.1)", "heat.colors(n)",
"terrain.colors(n)", "topo.colors(n)", "cm.colors(n)"))
{
        nt <- length(ch.col)
        i <- 1:n; j <- n / nt; d <- j/6; dy <- 2*d
        plot(i,i+d, type="n", yaxt="n", ylab="", main=main)
        for (k in 1:nt) {
                rect(i-.5, (k-1)*j+ dy, i+.4, k*j,
                        col = eval(parse(text=ch.col[k])), border = border)
                text(2*j,  k * j +dy/4, ch.col[k])
        }
}
n <- if(.Device == "postscript") 64 else 16
        # Since for screen, larger n may give color allocation problem
demo.pal(n)

A Queryable color picker (as suggested by Gabor Grothendieck on the R-help mailing list)

#make a color lookup function
getColorName <- function(colorNumber) colors()[colorNumber]

# pch = NA means no points plotted.  pch = 20 plots small dots.
# n is the number of points to identify interactively with mouse
printColorSampler <- function(n = 0, pch = NA, bg = "white") {
   i <- seq(colors())
   k <- ceiling(sqrt(length(i)))
   xy <- cbind(floor(i/k)*2, i %% k)
   opar <- par(bg = bg)
   on.exit(par = opar)
   plot(xy, axes = FALSE, xlab = "", ylab = "", pch=pch, col=colors())
   text(xy[,1]+.5, xy[,2]+.2, i, col = colors(), cex = 0.7)
   if (n > 0)
      colors()[identify(xy, n = n, labels = colors(), plot = FALSE)]
}

# test
printColorSampler(0)
printColorSampler(1)
printColorSampler(pch=20, bg="black")

 
Setup the plot layout, and plot both examples

#setup the layout, and print pane boundaries:
nf <- layout(matrix(c(1,1,2,2), 2, 2, byrow=FALSE), respect=TRUE, widths=c(1,2)) ; layout.show(nf)

#plot the pie chart:
pie(rep(1,12), col=rainbow(12))
#plot the palette chart:
demo.pal(n)

#save a copy to an EPS file:
dev.copy2eps()

Convert Munsell colors to computer-friendly RGB triplets

Soil color conversion: Munsell in LUV colorspace
Figure 1: Munsell color chips.
Soil color conversion: LUV colorspace
Figure 2: Common soil colors.
Soil color conversion: RGB colorspace
Figure 3: Commom soil colors in RGB.
Soil color conversion: soil color matrix
Figure 4: Soil colors in multiple color spaces
Soil color conversion: Soil Profile in RGB colorspace
Figure 5: Soil profile colors.

The Munsell color system was designed as a series of discrete color chips which closely approximation to the color sensitivity of the human eye. The description of color via three variables tied to perceptible properties (hue, value, and chroma) under a standardized illuminant (sunlight on a clear day) makes the Munsell system a good choice for recording and interpreting soil color data. However, numerical analysis of colors encoded in the Munsell system is difficult because they are from a discrete set of color chips and referenced by values that include both letters and numbers. Rossel et. al. (2006) give a good background on various color space models and their relative usefulness in the realm of soil science. The conversion of Munsell soil colors to RGB triplets, suitable for displaying on a computer screen or printing, is made complicated by the numerous operations involved in converting between color spaces. Figure 1 shows all possible (both real and unreal) Munsell color chips in the L*U*V color space. Figure 2 shows some of the common soil color chips in the same color space. Figures 2 through 5 depict common soil colors in the RGB color space, visualized both in R and POVRAY. Example R code on the conversion is given below.

 
Munsell color data can be downloaded here.
 
Color conversion equations here.

 
References:

  1. Rossel, R.A.V.; Minasny, B.; Roudier, P. & McBratney, A.B. Colour space models for soil science Geoderma, 2006, 133, 320-337.

Manual Conversion in R

 
Setup environment and load lookup table data

## load some libs
library(plotrix)
library(colorspace)

## munsell data comes with a lookup table in xyY colorspace
## url: http://www.cis.rit.edu/mcsl/online/munsell.php

## note:
## Munsell chroma, CIE x, y, and Y. The chromaticity coordinates were calculated using illuminant C and the CIE 1931 2 degree observer.
all <- read.table("munsell-all.dat", header=T)

 
Convert xyY to XYZ [Equation Reference]

## x and y are approx (0,1)
## Y is approx (0,100)

## need manually rescale Y to (0,1)
all$Y <- all$Y/100.0

## do the conversion
X <- (all$x * all$Y ) / all$y
Y <- all$Y
Z <- ( (1- all$x - all$y) * all$Y )  / all$y

## combine to form matrix for simple manipulation
mun_XYZ_C <- matrix(c(X,Y,Z), ncol=3)

## test for y == 0
## X,Y,Z should then be set to 0
mun_XYZ_C[which(all$y==0),] <- c(0,0,0)

 
Perform Chromatic Adaption Functions in the colorspace package, and sRGB profiles assume a D65 illuminant [Reference]

## conversion matrix, from reference above
## this has been revised as of Jan, 2008
M_adapt_C_to_D65 <- matrix(c(0.990448, -0.012371, -0.003564, -0.007168, 1.015594, 0.006770, -0.011615, -0.002928, 0.918157), ncol=3, byrow=TRUE)


## perform the chromatic adaption: convert from C -> D65 using Bradford method
mun_XYZ_D65 <- mun_XYZ_C %*% M_adapt_C_to_D65


## how different are the two?
summary( (mun_XYZ_D65 - mun_XYZ_C)  )

 
Convert XYZ (D65) to sRGB (D65), step 1 this assumes that XYZ is scaled to (0,1) [Reference Primaries for sRGB]

## first get the reference primaries transformation matrix from above
##
## sRGB profile transformation:
M_XYZ_to_sRGB_D65 <- matrix(c(3.24071, -0.969258, 0.0556352, -1.53726, 1.87599, -0.203996, -0.498571, 0.0415557, 1.05707), ncol=3, byrow=TRUE)

## apply the conversion matrix
mun_sRGB_D65 <- mun_XYZ_D65 %*% M_XYZ_to_sRGB_D65

 
Convert XYZ (D65) to sRGB (D65), step 2 (sRGB, gamma = 2.4) [Conversion Function to sRGB]

## define the transformation functions:
## these are applied on a conditional basis:
fun1 <- function(col_comp) { 1.055 * ( col_comp ^ ( 1 / 2.4 ) ) - 0.055 }
fun2 <- function(col_comp) { 12.92 * col_comp }

## the specific function is contingent on the absolute value of r,g,b components
R <- ifelse(mun_sRGB_D65[,1] > 0.0031308, fun1(mun_sRGB_D65[,1]), fun2(mun_sRGB_D65[,1]))  
G <- ifelse(mun_sRGB_D65[,2] > 0.0031308, fun1(mun_sRGB_D65[,2]), fun2(mun_sRGB_D65[,2]))  
B <- ifelse(mun_sRGB_D65[,3] > 0.0031308, fun1(mun_sRGB_D65[,3]), fun2(mun_sRGB_D65[,3]))  


##clip values to range {0,1}
R_clip <- ifelse(R < 0, 0, R)  
G_clip <- ifelse(G < 0, 0, G)  
B_clip <- ifelse(B < 0, 0, B)  

R_clip <- ifelse(R > 1, 1, R_clip)  
G_clip <- ifelse(G > 1, 1, G_clip)  
B_clip <- ifelse(B > 1, 1, B_clip)


## add these back to the original table:
all$R <- R_clip
all$G <- G_clip
all$B <- B_clip

## done with the conversion

## the manually converted data
plot( as(RGB(R_clip,G_clip,B_clip), 'LUV'), cex=0.5)

Using ColorBrewer to assist with thematic map color selection

RColorBrewer Color Combinations
RColorBrewer color combinations
RColorBrewer Color Combinations: 3 colors
Figure 2: 3 colors per combination
RColorBrewer Color Combinations: 9 colors
Figure 3: 9 colors per combination

Choosing the right colors for classes in a thematic map can be a difficult task. The ColorBrewer website provides an interactive tool for browsing numerous color combinations. Each of the color combinations presented on the ColorBrewer website are the culmination of numerous color interpretation studies. In addition, there is a list of special color combinations suitible for audiences which may include color blind individuals.

The R package RColorBrewer adds the color brewer color combinations as well as functions for generating new combinations to R. Figure 1 demonstrates the available color combinations, as returned by the function display.brewer.all.

 
An example R session:

#load the RColorBrewer package [must be installed first with install.packages()]
library(RColorBrewer)
&nbsp;
#display the "sequential" color combinations, with 3 colors per combination
#See Figure 2
display.brewer.all(n=3,type="seq",exact.n=TRUE)
title("Sequential Color Combinations: 3 Colors per Combination")
&nbsp;
#display the "sequential" color combinations, with 9 colors per combination
#See Figure 3
display.brewer.all(n=9,type="seq",exact.n=TRUE)
title("Sequential Color Combinations: 9 Colors per Combination")

&nbsp;
# convert R colors into RGB triplets;
col2rgb( brewer.pal("Accent", n=5) )






Comparison of Slope and Intercept Terms for Multi-Level Model

Premise

When the relationship between two variable is (potentially) dependent on a third, categorical variable ANCOVA (analysis of covariance), or some variant, is commonly used. There are several approaches to testing for differences in slope/intercepts (in the case of a simple linear model) between levels of the stratifying variable. In R the following formula notation is usually used to test for interaction between levels of a factor (f) and the relationship between two continuous variables x and y: y ~ x * f. A simple graphical exploration of this type of model can be done through examination of confidence intervals computed for slope and intercept terms, for each level of our grouping factor (f). An example of a fictitious dataset is presented below. Note that this a rough approximation for testing differences in slope/intercept within a multi-level model. A more robust approach would take into account that we are trying to make several pair-wise comparisons, i.e. something akin to Tukey's HSD. Something like this can be done with the multcomp package. For any real data set you should always consult a real statistician.

Example Multi-Level Model: each panel represents a model fit to y ~ x, for group fExample Multi-Level Model: each panel represents a model fit to y ~ x, for group f

 
Example Multi-Level Data

# need this for xyplot()
library(lattice)

# make some fake data:
x <- rnorm(100, mean=3, sd=6)
y <- x * runif(100, min=1, max=7) + runif(100, min=1.8, max=5)
d <- data.frame(x, y, f=rep(letters[1:10], each=10))

# check it out
xyplot(y ~ x | f, data=d, type=c('p','r'))

Implementation

Example Multi-Level Model: Confidence Intervals: parameter estimates along with 95% confidence interval, within each level of our grouping factor (f).Example Multi-Level Model: Confidence Intervals: parameter estimates along with 95% confidence interval, within each level of our grouping factor (f).

 
Automated Plotting of Parameter Confidence Intervals

# split by factor
d.l <- split(d, d$f)
# fit model for each level of factor
fits <- lapply(d.l, function(d_i) {lm(y ~ x, data=d_i)})

# extract coefs
est <- lapply(fits, coef)

# compute confints
ci <- lapply(fits, confint)

ci.mat <- do.call('rbind', ci)
est.mat <- do.call('rbind', est)
ci.df <- data.frame(f=rep(colnames(sapply(ci, '[')), each=2))
ci.df$lower <- ci.mat[,1]
ci.df$upper <- ci.mat[,2]

# re-attach estimate label
ci.df$which <- row.names(ci.mat)

# add dummy column for estimate
ci.df$estimate <- NA

# make a data frame for the estimates
est.df <- data.frame(which=rep(colnames(est.mat), each=nrow(est.mat)))
est.df$estimate <- as.vector(c(est.mat[,1], est.mat[,2]))
est.df$f <- rep(row.names(est.mat), 2)

# add dummy columns for upper and lower conf ints
est.df$upper <- NA
est.df$lower <- NA

# combine estimate with confints
combined <- rbind(est.df, ci.df)

# combined plot of estimate +/- confint
dotplot(f ~ estimate + lower + upper | which, data=combined, scales=list(relation='free'), xlab="Estimate", ylab="Group", auto.key=list(columns=3),
par.settings=list(superpose.symbol=list(col=c(1), pch=c(16,1,1), cex=c(1,0.75,0.75))))

Formal Evaluation with lm()

The first two lines in the output below are testing the hypothesis that the slope and intercept term for level 'a' are not different than 0. Subsequent hypothesis tests are relative to the first 'level' in the dataset. In this case we are testing the hypothesis that intercept and slope terms for levels 'b' through 'j' are not different than the corresponding terms for level 'a'. From the output below we can see that none of the intercept terms (levels 'b' through 'j') are different than for 'a', and that the slope term for level 'd' is only marginally "different" (p=0.0625) than the slope term for 'a'.

 
Testing Model Terms

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.99570    4.10909   1.216   0.2276    
x            4.40546    0.68230   6.457 7.68e-09 ***
fb          -4.66364    7.28233  -0.640   0.5237    
fc           1.10173    6.52890   0.169   0.8664    
fd           1.51033    6.20212   0.244   0.8082    
fe          -5.28549    6.62921  -0.797   0.4276    
ff          -1.37673    6.39280  -0.215   0.8300    
fg          -7.69480    5.93011  -1.298   0.1982    
fh          -2.34349    5.70703  -0.411   0.6824    
fi           1.14558    6.84805   0.167   0.8676    
fj          -1.12319    7.87523  -0.143   0.8869    
x:fb         0.92661    0.94257   0.983   0.3285    
x:fc         0.43454    1.04819   0.415   0.6796    
x:fd        -1.75956    0.93137  -1.889   0.0625 .  
x:fe        -0.08193    0.96216  -0.085   0.9323    
x:ff        -0.42669    0.99172  -0.430   0.6682    
x:fg         0.57531    0.99279   0.579   0.5639    
x:fh         1.63650    1.02319   1.599   0.1137    
x:fi        -0.38424    0.97753  -0.393   0.6953    
x:fj        -0.89373    1.14337  -0.782   0.4367    

Comparison of Slope and Intercept Terms for Multi-Level Model II: Using Contrasts

Premise

Small update to a similar thread from last week, on the comparison of slope and intercept terms fit to a multi-level model. I finally figured out (thanks R-Help mailing list!) how to efficiently use contrasts in R. The C() function can be called within a model formula, to reset the base level of an un-ordered factor. The UCLA Stats Library has an extensive description of this topic here. This approach can be used to sequentially test for differences between slope and intercept terms from a multi-level model, by re-setting the base level of a factor. See example data and figure below.

Note that the multcomp package has a much more robust approach to this type of operation. Details below.

 
Example Multi-Level Data

# need these
library(lattice)

# replicate an important experimental dataset
set.seed(10101010)
x <- rnorm(100)
y1 <- x[1:25] * 2 + rnorm(25, mean=1)
y2 <- x[26:50] * 2.6 + rnorm(25, mean=1.5)
y3 <- x[51:75] * 2.9 + rnorm(25, mean=5)
y4 <- x[76:100] * 3.5 + rnorm(25, mean=5.5)
d <- data.frame(x=x, y=c(y1,y2,y3,y4), f=factor(rep(letters[1:4], each=25)))

# plot
xyplot(y ~ x, groups=f, data=d,
auto.key=list(columns=4, title='Beard Type', lines=TRUE, points=FALSE, cex=0.75),
type=c('p','r'), ylab='Number of Pirates', xlab='Distance from Land')

Example Multi-Level Model IIExample Multi-Level Model II

 
Default Contrasts (contr.treatment for regular factors, contr.poly for ordered factors)

# standard comparison to base level of f
summary(lm(y ~ x * f, data=d))

# output:
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.0747     0.1889   5.689 1.51e-07 ***
x             1.9654     0.1799  10.927  < 2e-16 ***
fb            0.3673     0.2724   1.348   0.1808    
fc            4.1310     0.2714  15.221  < 2e-16 ***
fd            4.4309     0.2731  16.223  < 2e-16 ***
x:fb          0.5951     0.2559   2.326   0.0222 *  
x:fc          1.0914     0.2449   4.456 2.35e-05 ***
x:fd          1.3813     0.2613   5.286 8.38e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 
Setting the "base level" in the Model Formula This allows us to compare all slope and intercept terms to the slope and intercept from level 4 of our factor ('d' in our example).

# compare to level 4 of f
summary(lm(y ~ x * C(f, base=4), data=d))

# output:
Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.5055     0.1972  27.911  < 2e-16 ***
x                   3.3467     0.1896  17.653  < 2e-16 ***
C(f, base = 4)1    -4.4309     0.2731 -16.223  < 2e-16 ***
C(f, base = 4)2    -4.0635     0.2783 -14.603  < 2e-16 ***
C(f, base = 4)3    -0.2999     0.2773  -1.081  0.28230    
x:C(f, base = 4)1  -1.3813     0.2613  -5.286 8.38e-07 ***
x:C(f, base = 4)2  -0.7862     0.2628  -2.992  0.00356 **
x:C(f, base = 4)3  -0.2899     0.2521  -1.150  0.25327    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 
Testing with Multcomp Package using data from above example

# need these
library(multcomp)
library(sandwich)

# open this vignette, lots of good information
vignette("generalsiminf", package = "multcomp")

# fit two models
l.1 <- lm(y ~ x + f, data=d)
l.2 <- lm(y ~ x * f, data=d)

# note that: tests are AGAINST the null hypothesis
summary(glht(l.1))

# see the plotting methods:
plot(glht(l.1))
plot(glht(l.2))

# pair-wise comparisons
summary(glht(l.1, linfct=mcp(f='Tukey')))

# pair-wise comparisons
# may not be appropriate for model with interaction
summary(glht(l.2, linfct=mcp(f='Tukey')))

# when variance is not homogenous between groups:
summary(glht(l.1, linfct=mcp(f='Tukey'), vcov=sandwich))

Computing Statistics from Poorly Formatted Data (plyr and reshape packages for R)

 
Premise
I was recently asked to verify the coefficients of a linear model fit to sets of data, where each row of the input file was a "site" and each column contained the dependent variable through time (i.e. column 1 = time step 1, column 2 = time step 2, etc.). This format is cumbersome in that it cannot be directly fed into the R lm() function for linear model fitting. Furthermore, we needed the output formatted with columns containing slope, intercept, and R-squared values for each site (rows). All of the re-formatting, and model fitting can be done by hand, using basic R functions, however this task seemed like a good case study for the use of the reshape and plyr packages for R. The reshape package can be used to convert between "wide" and "long" format-- the first step in the example presented below. The plyr package can be used to split a data set into subsets (based on a grouping factor), apply an arbitrary function to the subset, and finally return the combined results in several possible formats. The original input data, desired output, and R code are listed below.

 
Input

2.521 2.312 2.720 2.254 * 2.922 * * 2.291 2.038 * * 1.151
1.675 1.646 1.860 2.517 * 1.986 * * 3.279 3.420 * * 3.059
1.734 1.305 1.774 2.366 * 2.909 * * 2.863 2.958 * * 2.973
1.637 1.632 2.040 1.807 * 1.889 * * 2.081 2.267 * * 2.655
1.967 8.307 8.331 8.698 * 8.236 * * 7.990 8.255 * * 8.041
1.670 1.744 1.982 2.029 * 2.159 * * 3.330 2.945 * * 3.301
1.668 1.816 1.832 2.100 * 2.289 * * 2.745 2.703 * * 3.216
2.304 2.413 2.749 2.827 * 2.978 * * 3.011 3.244 * * 4.494
1.505 2.827 3.375 1.923 * 4.250 * * 1.542 3.094 * * 1.480

 
Output

site intercept slope Rsq
1 1 2.8123 -0.0894 0.5115
2 2 1.5229 0.1512 0.7682
3 3 1.5499 0.1351 0.7445
4 4 1.5581 0.0738 0.8453
5 5 6.1738 0.2174 0.1727
6 6 1.4787 0.1527 0.9026
7 7 1.5340 0.1270 0.9871
8 8 2.1403 0.1437 0.8224
9 9 2.7546 -0.0425 0.0306

 
Add required libraries and load example data files

# these may need to be installed with install.packages()
library(plyr)
library(reshape)
# this one comes with the base install of R
library(lattice)

# read in the data as pasted from original format
d1 <- read.table('d1.txt', na.strings='*')
d2 <- read.table('d2.txt')

 
Reshape data

# transpose and convert to long format
d1.long <- melt(t(d1))
d2.long <- melt(t(d2))

# give resonable names
names(d1.long) <- c('obs','site','value')
names(d2.long) <- c('obs','site','value')

# add time variable
#  1:n obs * m sites
d1.long$time <- rep(1:13, 9)
d2.long$time <- rep(1:8, 7)

 
Visually check patterns

xyplot(value ~ time | factor(site), type=c('p','r'), data=d1.long, as.table=TRUE)
xyplot(value ~ time | factor(site), type=c('p','r'), data=d2.long, as.table=TRUE)

 
Extract linear model terms and R-squared for each subset

fit.summary <- function(i)
{
        # fit linear model to this set of the data
        l <- lm(value ~ time, data=i)
        # extract model terms
        l.coef <- coef(l)
        # extract R-squared
        l.rsq <- summary(l)$r.squared
        # combine model details into a single vector
        l.details <- c(l.coef, l.rsq)
        # rename elements of the vector
        names(l.details) <- c('intercept', 'slope', 'Rsq')
        # return rounded values to the calling function
        return(round(l.details, 4))
}

# compute lm details by site
ddply(d1.long, .(site), fit.summary)
ddply(d2.long, .(site), fit.summary)

Creating a Custom Panel Function (R - Lattice Graphics)

 
The Experiment
It was necessary (for the purposes of this exercise) to generate some grouped data worthy of a creative panel function. An experiment was designed to test the coordination of 4 individuals (each a panel in the figure below), as a function of "clarity of mind" (symbol color in the figure below). The actual details of the experiment can be deduced from the attached data file, and careful inspection of the resulting plot. A similar experiment was conducted some time ago to demonstrate the Spatstat package in R.

 
A Customized Panel Function for Lattice Graphics -- "panel.bulls_eye()"
Lattice graphics are one of several possible visualization methods in available in R that are most useful when working with grouped data. Plots are generated via a formula interface, often in the format of y ~ x | f -- where y is the dependent variable, x is the independent variable, and f is a grouping factor. Have a look at the attached file (bottom of page) for an example of data in this format. Each panel in the plot is generated by a panel function, using a subset of the original data as defined by the grouping variable. In most situations the standard panel functions, such as panel.xyplot, are sufficient. However, when working with more "interesting" data, a customized panel function is the way to go.

 
In order to try the sample code out, you will need to:

  1. install the required packages
  2. copy and paste the panel.bulls_eye function source into an R session
  3. download the sample data file
  4. run the code listed in the sample R session

 
Since panel functions are made to be generic, any data source that is similar in nature to the sample can be directly plotted using this code-- i.e. if the experiment were to be repeated using 8 subjects instead of 4. Enjoy.

 
Panel Function Source

panel.bulls_eye <- function(x, y, groups, subscripts, ...)
{
# setup the initial plot, and add the raw data
panel.xyplot(jitter(x), jitter(y), groups=groups, subscripts=subscripts, cex=1.25, pch=3, col=c(1,2,3), ...)

# add the bull's eye
panel.points(0, 0, pch=16, cex=0.25, col='grey')
panel.points(0, 0, pch=1, cex=1.75, col='grey')
panel.points(0, 0, pch=1, cex=4, col='grey')
panel.points(0, 0, pch=1, cex=7, col='grey')

# compute the mean cartesian distance from the bull's eye to all points
z <- signif(mean(sqrt(x^2 + y^2)), 3)
z.text <- paste(z, 'cm')

# compute the mean angle between all points and bull's eye
theta <- circ.mean(atan2(y, x))
theta.text <- paste(signif(theta* 180/pi, 2), 'deg')

# generate a displacement vector
x_prime <- z * cos(theta)
y_prime <- z * sin(theta)

# add the vector to the plot
panel.segments(0, 0, 3, 0, col='grey')
panel.arrows(0, 0, x_prime, y_prime, length=0.1, col='black')

# annotate with accuracy and displacement angle
grid.text(label = z.text, gp=gpar(fontsize=16), just='left',
              x = unit(0.05, "npc"),
              y = unit(0.95, "npc"))
grid.text(label = theta.text, gp=gpar(fontsize=16), just='right',
              x = unit(0.90, "npc"),
              y = unit(0.95, "npc"))
}

 
Example Session (note that several packages are required)

# load required libraries
library(spatstat)
library(lattice)
library(grid)
library(CircStats)

# read in our data (see attached file)
x <- read.csv('beer_battle.csv')

# plot the data, as stratified by person
xyplot(y ~ x | person, groups=beer, data=x, panel=panel.bulls_eye,
key=list(points=list(col=c(1,2,3), pch=c(3,3,3)), text=list(c('0 beers', '1 beer', '3 beers')), columns=3),
main='Beer Battle 1'
)

Results: example output from the panel.bulls_eye() function used with xyplot().Results: example output from the panel.bulls_eye() function used with xyplot().

Customized Scatterplot Ideas

 
Panel function for visualizing univariate statistics

panel.dist_summary <- function(x, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 3) )
  #hist(x, probability=T, add=T)
&nbsp;  
  #default color and line style for density plot
  density.col = 'gray'
  density.lty = 3
&nbsp;  
  # is this a normally distributed dataset?
  # if so, change the color of the density plot
  # The test rejects the null hypothesis if W is too small.
  s.W <- shapiro.test(x)$statistic
 if( (s.W > 0.91) == TRUE)
   {
   density.col = 'gray'
   density.lty = 1
   }
&nbsp;  
  # compute and plot density
  d <- density(x)
  dy <- d$y / max(d$y) * .5
  lines(d$x, dy, col=density.col, lty=density.lty)
&nbsp;  
  # get a small increment to use in the next tests:
  delta <- abs(min(x) - max(x)) / 100
&nbsp;  
  y_mean <- dy[d$x < mean(x) + delta & d$x > mean(x) - delta][1]
  y_median <- dy[d$x < median(x) + delta & d$x > median(x) - delta][1]
&nbsp;  
  debug
  #print(y_median)
&nbsp;  
  #add points on the density plot for the mean and median
  points( c(mean(x), median(x)), c(y_mean, y_median), col=c('red', 'orange'), pch=16)
&nbsp;  
  #add a boxplot
  boxplot(x, horizontal=TRUE, boxwex=0.3, add=T)
&nbsp;  
  #debugging
  #print(s.W)
}

 
Panel function for printing joint correlation statistic

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, cor.method="pearson")
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y, method=cor.method))
  txt <- format(c(r, 0.123456789), digits=digits)[1]
  txt <- paste(prefix, txt, sep="")
  if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex * r, col='gray')
&nbsp;  
  # might be interesting to use ks.test
  # http://www.physics.csbsju.edu/stats/KS-test.html
}

 
Example usage with built-in datasets

# enforce square plotting area
par(pty='s')
&nbsp;
pairs(USJudgeRatings[1:5], upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
&nbsp
pairs(iris[1:4], upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
&nbsp;
# use spearman correlation calculation instead of default person:
pairs(iris[1:4], upper.panel=function(...) panel.cor(..., cor.method="spearman"), lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
&nbsp;
# color iris specis: note location of 'col=' argument
pairs(iris[1:4], upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1, pch=16, col=c("red4", "green3", "blue4")[unclass(iris$Species)]), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2 )
&nbsp;
pairs(trees, upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
&nbsp;
pairs(swiss, upper.panel=panel.cor, lower.panel=function(...) panel.smooth(...,
col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
&nbsp;
# using formula notation:
pairs( ~ Fertility + Education + Catholic, data=swiss, upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
&nbsp;
pairs(longley, upper.panel=panel.cor, lower.panel=function(...) panel.smooth(...,
col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)

Estimating Missing Data with aregImpute() {R}

 
Missing Data
Soil scientists routinely sample, characterize, and summarize patterns in soil properties in space, with depth, and through time. Invariably, some samples will be lost or sufficient funds required for complete characterization can run out. In these cases the scientist is left with a data table that contains holes (so to speak) in the rows/columns that are missing data. If the data are used within a regression, missing values in any of the predictor or the response variable result in row-wise deletion-- even if 9/10 variables are present. Furthermore, common multivariate methods (PCA, RDA, dissimilarity metrics, etc.) cannot effectively deal with missing data. The scientist is left with a couple options: 1) row-wise deletion of cases missing any variable, 2) re-sampling or re-characterizing the missing samples, or 3) estimating the missing values from other variables in the dataset. This last option is called missing data imputation. This is a broad topic with countless books and scientific papers written about it. Here is a fairly simple introduction to the topic of imputation. Fortunately for us non-experts, there is an excellent function (aregImpute()) in the Hmisc package for R.

Below is an example of filling missing data in a soil characterization database with the aregImpute function. For each missing value, 10 candidate multiple imputations are returned. Otherwise, the function is using default parameters-- there are a lot of options, so reading the manual page is highly recommended! From the example below, it looks like we are able to adequately predict missing observations in most variables-- R2 values are all > 0.5 - 0.6. Note that we are using the aregImpute function to automatically find the "best model" for predicting missing values (for each variable).

 
Implementation

## impute missing data: with aregImpute
# updated version of methods used in transcan
# multiple impution, requesting 10 candidate values per NA
x.ar <- aregImpute(~ L + A + B + clay + silt + sand + ph + fe_d + fe_o + mn_d + mn_o + Fe + Ca + K + Al + Si + Ti + Zr + Rb + S + Zn, data=x, n.impute=10)


#
# R-squares for Predicting Non-Missing Values for Each Variable Using Last Imputations of Predictors
# not bad!
#
    L     A     B  clay  silt  sand    ph  fe_d  fe_o  mn_d  mn_o    Fe    Ca
0.949 0.933 0.934 1.000 1.000 1.000 0.567 0.950 0.597 0.906 0.902 0.913 0.844
    K    Al    Si    Ti    Zr    Rb     S    Zn
0.860 0.839 0.829 0.885 0.886 0.885 0.680 0.730

I am interested in replacing missing data with the mean of the multiple imputations for each case. The following code below demonstrates one possible approach. However, this is not the suggested approach for incorporating the imputed values into subsequent analysis! Regression models should be iteratively fit to data containing a single value of each multiple imputation, and model coefficients combined according to rules for mixture distributions. (Thanks for the tip Cyrus). There are functions within the Hmisc, rms, and Zelig packages for automating these procedures.

 
Implementation (slightly improper use of multiple imputation)

# get a list of those variables with imputed values
imp.vars <- names(x.ar$imputed)

# compute mean imputed value for each variable
# and extract the original indices of the missing data, by variable
imp.vars.mean <- lapply(x.ar$imputed, function(i) apply(i, 1, mean))
imp.vars.idx <- lapply(imp.vars.mean, function(i) as.integer(names(i)))

# copy origial data
x.no.na <- x

# loop over imputed variables
for(i in imp.vars)
        {
        print(i)
        # get the mean imputations for this variable
        imp.i <- imp.vars.mean[[i]]
        # get the original indices for NA
        idx.i <- imp.vars.idx[[i]]
        # replace original NA with imputed values
        x.no.na[idx.i, i] <- imp.i
        }

Exploration of Multivariate Data

library(gclus)
library(car)
library(MASS)
library(cluster)
library(lattice)
library(TeachingDemos)

data(wine)

# chernoff faces
faces(aggregate(wine, list(wine$Class), FUN=mean)[,-c(1,2)], ncol=3, nrow=1)

# LDA on wine data
l <- lda(Class ~ . , data=wine)
plot(l, col=wine$Class)



## Some soils data from the car package
# LDA: all horizons
l <- lda(Contour ~ pH + N + Dens + P + Ca + Ca + Mg + K + Na, data=Soils)
plot(l, col=as.numeric(Soils$Contour))

# just the top horizon
l <- lda(Contour ~ pH + N + Dens + P + Ca + Ca + Mg + K + Na, data=Soils, subset=Depth=='0-10')
plot(l, col=as.numeric(Soils$Contour[Soils$Depth == '0-10']))

Interactive 3D plots with the rgl package

 
Overview
Sample application of the RGL package. This package allows for the creation of interactive, 3D figures, complete with lighting and material effects. Try demo(rgl) for an idea of what is possible.

A random number generator sphere (RNG sphere) was created based on the suggestions in Keys to Infinity by Clifford A. Pickover, pp. 237-239. The RNG sphere can be used to test the robustness of a random number generator. Three random number generators were tested: runif() from R, rand from Excel, and a logistic-derived psudo-random number generator. The location (x,y,z) and color of the spheres are based on the sequence of random numbers (Pickover, 1995). An ideal RNG shpere should have no discernable patterns. Note that the logistic-derived random numbers show distinct correlation in the RNG sphere. Excel random number list, and source code (R) are attached at the botom of the page.

RGL sample application: 3d interactive interface to a random number generator sphere.RGL sample application: 3d interactive interface to a random number generator sphere. Random numbers from runif() function in R.

RGL sample application 2: Excel random number visualization: 3d interactive interface to a random number generator sphere. Random numbers from rand() function in MS Excel.RGL sample application 2: Excel random number visualization: 3d interactive interface to a random number generator sphere. Random numbers from rand() function in MS Excel.

RGL sample application 2: Random numbers from the logistic function: 3d interactive interface to a random number generator sphere. Random numbers from the logistic function (see notes), implemented in R.RGL sample application 2: Random numbers from the logistic function: 3d interactive interface to a random number generator sphere. Random numbers from the logistic function (see notes), implemented in R.

 
Random Number Generator (RNG) Sphere Function Definition

 # simple function for
rng_sphere <- function(d, type='rgl')
{

n <- length(d)
nn <- n - 3

# init our x,y,z coordinate arrays
x <- array(dim=nn)
y <- array(dim=nn)
z <- array(dim=nn)

# init red,green,blue color component arrays
cr <- array(dim=nn)
cg <- array(dim=nn)
cb <- array(dim=nn)


# convert lagged random numbers from d into spherical coordinates
# then convert to cartesian x,y,z coordinates for simple display
for (i in 1:nn)
{
theta <- 2*pi*d[i]
phi <- pi*d[i+1]
r <- sqrt(d[i+2])

x[i] <- r * sin(theta) * cos(phi)
y[i] <- r * sin(theta) * sin(phi)
z[i] <- r * cos(theta)

# give each location a color based on some rules
cr[i] <- d[i] / max(d)
cg[i] <- d[i+1] / max(d)
cb[i] <- d[i+2] / max(d)

} # end function


if( type == 'rgl')
{
# setup rgl environment:
zscale <- 1
 
# clear scene:
clear3d("all")
 
# setup env:
bg3d(color="white")
light3d()
 
# draw shperes in an rgl window
spheres3d(x, y, z, radius=0.025, color=rgb(cr,cg,cb))
}

if(type == '2d')
{
# optional scatterplot in 2D
scatterplot3d(x,y,z, pch=16, cex.symbols=0.25, color=rgb(cr,cg,cb), axis=FALSE )
}


# optionally return results
# list(x=x, y=y, z=z, red=cr, green=cg, blue=cb)

}

 
Sample

 # load required packages
require(scatterplot3d)
require(rgl)

# random number with runif():
d <- runif(2500)

# plot rng sphere with rgl:
rng_sphere(d, type='rgl')

# save results of the rgl window
rgl.snapshot('testing.png')

# plot rng sphere with scatterplot3d:
rng_sphere(d, type='2d')

# save results
dev.copy2eps()

# 2500 excel random numbers
# =rand()
dd <- as.vector(unlist(read.csv('excel_rand.csv')))
rng_sphere(dd, type='rgl')
rgl.snapshot('testing-excel.png')


# 1000 random numbers from the logistic function:

# init an array
ddd <- array(dim=1000)

# seed
ddd[1] <- 0.38273487234

# compute for the next 999 iterations
for (i in 1:999) { ddd[i+1] <- 4 * 1 * ddd[i] * (1 - ddd[i]) }

Making Soil Property vs. Depth Plots

Example with randomly generated data

 
Generate some data

## generate some profile depths: 0 - 150, in 10 cm increments
depth <- seq(0,150, by=10)

## generate some property: random numbers in this case
prop <- rnorm(n=length(depth), mean=15, sd=2)

## since the 0 is not a depth, and we would like the graph to start from 0
## make the first property row (associated with depth 0) the same as the second
## property row
prop[1] <- prop[2]

## combine into a table: data read in from a spread sheet would already be in this format
soil <- data.frame(depth=depth, prop=prop)

 
The dataframe 'soil' looks like this:

   depth     prop
1      0 13.80257  ** note that these are the same
2     12 13.80257  ** note that these are the same
3     24 18.40298
4     36 13.37446
5     48 13.27973
6     60 14.65288
7     72 16.07339
8     84 15.97451
9     96 16.29970
10   108 16.32155
11   120 14.63699
12   132 13.26486
13   144 13.81730

 
Plot the data:

## note the reversal of the y-axis with ylim=c(150,0)
plot(depth ~ prop, data=soil, ylim=c(150,0), type='s', ylab='Depth', xlab='Property', main='Property vs. Depth Plot')

Additional Example Using Lattice Graphics

Examples with Some Real Data

 
Notes:

  • See attached files at bottom of page
  • Helper function could use some generalization. Until then, your data will need to have the columns:
    1. pedon_id
    2. top
    3. bottom
  • These examples require a recent version of R and Lattice (>= 2.5.1)

 
Helper Function (copy this into your R session first)

## function to add a repeat top horizon
## for correct step-plot
## assumes that there are columns named pedon_id, bottom, top
profile_fix <- function(d)
{
## init some vars
p <- levels(d$pedon_id)
idx <- array()
i <- 1

## loop over pedon ids
for(p.id in p)
{
## extract one at a time
p.row <- subset(d, subset=pedon_id == p.id)

## make a list of the positions where bottom horizons occur
idx[i] <- which(d$top == min(p.row$top) & d$pedon_id == p.id)

## increment counter
i <- i+1
}

## extract out bottom horizons
d.temp <- d[idx,]

## set the top of these to the bottom boundary
d.temp$bottom <- d.temp$top

## add duplicate bottom horizon records, with top set to bottom
d.new <- rbind(d, d.temp)

return(d.new)
}

 
Load Data and Packages

## load libs
library(lattice)
## read in the first example
x <- read.csv('psa.csv')
## convert pedon_id to a factor:
x <- transform(x, pedon_id = factor(pedon_id))
## add extra top horizon
x.new <- profile_fix(x)
##
##
## read in the second example
y <- read.csv('example_data.csv')
##  add the extra top horizon
y.new <- profile_fix(y)

 
Example 1

## plot using step function
## note special syntax: horizontal=TRUE
xyplot(bottom ~ sand + silt + clay | pedon_id, horizontal=TRUE,
data=x.new, ylim=c(160,-5), type='s', auto.key=TRUE,
col=c('Orange', 'RoyalBlue', 'Dark Green'), lty=c(2,2,1), lwd=c(1,1,2),
ylab='Depth (cm)', xlab='Percent Sand, Silt, Clay',
key=list(
lines=list(col=c('Orange', 'RoyalBlue', 'Dark Green'), lwd=c(1,1,2), lty=c(2,2,1)
),
text=list(
c('Sand', 'Silt', 'Clay')
)
)
)

Depth Profile Example 1: sand, silt, and clay vs. depth for three pedonsDepth Profile Example 1: sand, silt, and clay vs. depth for three pedons

 
Example 2

## plot with step function
xyplot(bottom ~ field_pct_clay | pedon_id, horizontal=TRUE,
data=y.new, ylim=c(250,-5), type='s', as.table=TRUE,
ylab='Depth (cm)', xlab='Percent Clay', lwd=2, col='black'
)

Depth Profile Example 2: clay vs. depth for 9 pedonsDepth Profile Example 2: clay vs. depth for 9 pedons

New R Package 'aqp': Algorithms for Quantitative Pedology [updates]

 
Soils are routinely sampled and characterized according to genetic horizons (layers), resulting in data that are associated with principal dimensions: location (x,y), depth (z), and property space (p). The high dimensionality and grouped nature of this type of data can complicate standard analysis, summarization, and visualization. The aqp package was developed to address some of these issues, as well as provide a useful framework for the advancement of quantitative studies in soil genesis, geography, and classification.

The Algorithms for Quantitative Pedology (AQP) project was started as a place to put some of the functions I used on a daily basis. It slowly grew into a collection of algorithms that support studies related to soil genesis, taxonomy, and mapping. The short list of functionality includes:

  • classes for storing and manipulating soil profile data
  • functions for plotting soil profile sketches
  • color conversion functions (Munsell to RGB)
  • plotting functions for groups of soil profiles
  • plotting functions for depth-functions + uncertainty estimates
  • functions for re-sampling genetic horizons to regular sequences
  • profile aggregation functions
  • an implementation of between-profile dissimilarity calculation
  • functions for re-sampling XRD patterns to a common basis
  • functions for performing full-pattern matching of XRD patterns

Obviously, this project is far from an exhaustive implementation of what the name might otherwise suggest.

My reasoning for putting this package together is based on the premise that methods used by pedologists have grown with our understanding of natural systems and our ability to work with larger and larger datasets-- however, coupling of theory and practice to modern numerical methods is rare outside of academic circles. That is not to say that federal agencies and the soil science community are not catching up. Rather, it seems like the adoption of these new approaches can by hindered by a lack of an implementation that is more accessible than an article in a scientific journal. I am speaking from my experience in the western United States, therefore I apologies if I have not adequately described corresponding conditions in other parts of the world.

The naming of the package. It is currently named "Algorithms for Quantitative Pedology", inspired by Jenny's treatise on the topic. It could be argued that most modern pedologic studies (i.e. on the topic of soil genesis, morphology, classification, and mapping) are by and large a quantitative effort-- so this name may be more of a convenient acronym than descriptive label. In naming
this package I was hoping to encapsulate the idea of how modern numerical methods might be applied to the field of pedology. Well, the word "Pedometrics" does a pretty good job of conveying that idea. In particular, it falls well in line with Alex's initial description of Pedometrics: “the application of mathematical and statistical methods for the study of the distribution and genesis of soils”.

On the other hand, the potential for a diffuse interpretation of the word "Pedometrics" could lead to a package that lacks focus and ultimately critical mass. In that case, an entire suite of related R packages would likely work. There are several cases on CRAN / R-Forge where large efforts are split into several, related packages. In addition, the term 'Pedometrics' means many different things to many different people.

I would like to present a couple ideas on how this project, and extensions of it, might function as a liaison between rapid advances in the realm of Pedometrics and those interested in the application of these methods.

  1. R is an ideal environment for working with soils information and the packaging system is a robust approach to encapsulating code, documentation, and discussion.
  2. A suite of R packages, each geared towards a specific realm of what most soil scientists do (or would like to do) with their data, may be one way in which common work flows can be streamlined and better documented. These packages would be the vehicle by which theory, defined in academic journals, could be translated into well-documented tools for a wider range of practitioners.
  3. Within this suite, a master package (why not call it "pedometrics") could contain classes and utility functions for describing and manipulating objects that represent the complex structure of a soil profile. The companion packages would build on these common data structures and methods, making the implementation of newly discovered or experimental methods much simpler.
  4. The R-Forge system would facilitate concurrent access to the source code and documentation through the use of revision control. In this way, changes would be documented, and versions could be seamlessly forked or merged so that rapid development would not result in an unstable product.

Ultimately, I am interested in getting modern tools into the hands of current pedologists, and making it simpler for the adoption of these tools by new pedologists. If you are interested in these efforts, have comments or criticisms please don't hesitate to contact me. There is a vignette with an extended discussion on the AQP package, included in the package. I would encourage anyone that is interested to check the R-Forge site. A slightly older version is available on CRAN. Give the package a try by installing R, and then running:

# install the package
install.packages("aqp", repos="http://R-Forge.R-project.org", dep=TRUE, type='source')

# load the package
library(aqp)

# open the mini-manual
vignette("aqp-vignette")

Numerical Integration/Differentiation in R: FTIR Spectra

 
Stumbled upon an excellent example of how to perform numerical integration in R. Below is an example of piece-wise linear and spline fits to FTIR data, and the resulting computed area under the curve. With a high density of points, it seems like the linear approximation is most efficient and sufficiently accurate. With very large sequences, it may be necessary to adjust the value passed to the subdivisions argument of integrate(). Strangely, larger values seem to solve problems encountered with large datasets...

FTIR Spectra IntegrationFTIR Spectra Integration

 
Implementation

# numerical integration in R
# example based on: http://tolstoy.newcastle.edu.au/R/help/04/10/6138.html

# sample data: FTIR spectra
x <- read.csv(url('http://casoilresource.lawr.ucdavis.edu/drupal/files/fresh_li_material.CSV'), header=FALSE)[100:400,]
names(x) <- c('wavenumber','intensity')

# fit a piece-wise linear function
fx.linear <- approxfun(x$wavenumber, x$intensity)

# integrate this function, over the original limits of x
Fx.linear <- integrate(fx.linear, min(x$wavenumber), max(x$wavenumber))

# fit a smooth spline, and return a function describing it
fx.spline <- splinefun(x$wavenumber, x$intensity)

# integrate this function, over the original limits of x
Fx.spline <- integrate(fx.spline, min(x$wavenumber), max(x$wavenumber))

# visual check, linear and spline fits shifted up for clarity
par(mar=c(0,0,0,0))
plot(x, type = "p", las=1, axes=FALSE, cex=0.5, ylim=c(0,0.12))
lines(x$wavenumber, fx.linear(x$wavenumber) + 0.01, col=2)
lines(x$wavenumber, fx.spline(x$wavenumber) + 0.02, col=3)
grid(nx=10, col=grey(0.5))
legend(x=615, y=0.11, legend=c('original','linear','spline'), col=1:3, pch=c(1,NA,NA), lty=c(NA, 1, 1), bg='white')

# results are pretty close
data.frame(method=c('linear', 'spline'), area=c(Fx.linear$value, Fx.spline$value), error=c(Fx.linear$abs.error,Fx.spline$abs.error))

  method     area        error
1 linear 27.71536 0.0005727738
2 spline 27.71527 0.0030796529

 
splinefun() can also compute derivatives

par(mar=c(0,0,0,0), mfcol=c(2,1))
plot(x, type = "l", lwd=2, axes=FALSE)
grid(nx=10, col=grey(0.5))
plot(x$wavenumber, fx.spline(x$wavenumber, deriv=1), type='l', axes=FALSE)
lines(x$wavenumber, fx.spline(x$wavenumber, deriv=2), col='red')
grid(nx=10, col=grey(0.5))
abline(h=0, lty=2)
legend('topright', legend=c('1st derivative','2nd derivative'), lty=1, col=1:2, bg='white')

Numerical Estimation of DerivativesNumerical Estimation of Derivatives

Plotting XRD (X-Ray Diffraction) Data

 
Premise:
Some examples on how to prepare and present data collected from an XRD analysis. The clay fraction from seven horizons was analyzed by XRD, using the five common treatments: potassium saturation (K), potassium saturation heated to 350 Deg C (K 350), potassium saturation heated to 550 Deg C (K 550), magnesium saturation (Mg), and magnesium + glycerin saturation (Mg+GLY). These data files have been attached, and can be found near the bottom of the page.

 
Plotting the entire data set with lattice graphics:

## load libs
require(lattice)
require(reshape)

## read the composite data in as a table
## format is 2theta,MG,MG+GLY,K,K350,K550
h1 <- read.csv("tioga1_0-8.csv", header=FALSE)
h2 <- read.csv("tioga1_8-15.csv", header=FALSE)
h3 <- read.csv("tioga1_15-35.csv", header=FALSE)
h4 <- read.csv("tioga1_35-65.csv", header=FALSE)
h5 <- read.csv("tioga1_65-90.csv", header=FALSE)
h6 <- read.csv("tioga1_90-120.csv", header=FALSE)
h7 <- read.csv("tioga1_120-150.csv", header=FALSE)


## load some common d-spacings:
d_spacings <- c(0.33,0.358,0.434,0.482,0.717,1,1.2,1.4,1.8)
d_spacing_labels <- c(".33", ".36", ".43", ".48", ".7","1.0","1.2","1.4","1.8")

## combine horizons, and
xrd <- make.groups(h1, h2, h3, h4, h5, h6, h7)
names(xrd) <- c('x', 'MG', 'MG+GLY', 'K', 'K 350', 'K 550', 'horizon')

## convert data into long format
xrd.long <- melt(data=xrd, id.var=c('x', 'horizon'), measure.var=c('K','K 350', 'K 550', 'MG', 'MG+GLY'), variable_name='treatment')

## set a better ordering of the treatments
xrd.long$treatment <- ordered(xrd.long$treatment, c('MG', 'MG+GLY', 'K', 'K 350', 'K 550'))


## change the strip background colors
##  trellis.par.set(list(strip.background = list(col = grey(c(0.9,0.8)) )))

## plot the data along with some common d-spacings:
xyplot(value ~ x | treatment + horizon , data=xrd.long, as.table=TRUE, ylim=c(0,500), xlab='Deg 2Theta', ylab='Counts', panel=function(x, y, ...) {panel.abline(v=(asin(0.154/(2*d_spacings)) * 180/pi * 2), col=grey(0.9)) ; panel.xyplot(x, y, ..., type='l', col='black')} )

## another approach: labels on the side
xyplot(value ~ x | horizon + treatment , data=xrd.long, as.table=TRUE, ylim=c(0,500), xlab='Deg 2Theta', ylab='Counts', panel=function(x, y, ...) {panel.abline(v=(asin(0.154/(2*d_spacings)) * 180/pi * 2), col=grey(0.9)) ; panel.xyplot(x, y, ..., type='l', col='black')}, strip.left=TRUE, strip=FALSE)

Example XRD plot with lattice graphics: 7 horizons and 5 treatmentsExample XRD plot with lattice graphics: 7 horizons and 5 treatments

Find peaks in an XRD dataset

 
Locating relevant peaks in an X-ray diffractogram is an important step in identifying phyllosilicate minerals in soils. An automated approach to finding peaks in any dataset was presented by Martin Maechler, contributed to the R-Help mailing list Nov 25, 2005. paste these functions into an R session to use them

peaks <- function(series, span = 3, do.pad = TRUE) {
    if((span <- as.integer(span)) %% 2 != 1) stop("'span' must be odd")
    s1 <- 1:1 + (s <- span %/% 2)
    if(span == 1) return(rep.int(TRUE, length(series)))
    z <- embed(series, span)
    v <- apply(z[,s1] > z[, -s1, drop=FALSE], 1, all)
    if(do.pad) {
        pad <- rep.int(FALSE, s)
        c(pad, v, pad)
    } else v
}

peaksign <- function(series, span = 3, do.pad = TRUE)
{
    ## Purpose: return (-1 / 0 / 1) if series[i] is ( trough / "normal" / peak )
    ## ----------------------------------------------------------------------
    ## Author: Martin Maechler, Date: 25 Nov 2005

    if((span <- as.integer(span)) %% 2 != 1 || span == 1)
        stop("'span' must be odd and >= 3")
    s1 <- 1:1 + (s <- span %/% 2)
    z <- embed(series, span)
    d <- z[,s1] - z[, -s1, drop=FALSE]
    ans <- rep.int(0:0, nrow(d))
    ans[apply(d > 0, 1, all)] <- as.integer(1)
    ans[apply(d < 0, 1, all)] <- as.integer(-1)
    if(do.pad) {
        pad <- rep.int(0:0, s)
        c(pad, ans, pad)
    } else ans
}


check.pks <- function(y, span = 3)
    stopifnot(identical(peaks( y, span), peaksign(y, span) ==  1),
              identical(peaks(-y, span), peaksign(y, span) == -1))

for(y in list(1:10, rep(1,10), c(11,2,2,3,4,4,6,6,6))) {
    for(sp in c(3,5,7))
        check.pks(y, span = sp)
    stopifnot(peaksign(y) == 0)
}

 
Commands to find and plot the peaks, based on suggestions by peaks() function author.

## load some sample data
d <- read.csv("tioga1_35-65.csv", header=FALSE)

## name the columns
names(d) <- c('x', 'MG', 'MG+GLY', 'K', 'K 350', 'K 550')

## locate peaks in the 'K' signal
## the second argument is the "sensitivity" of the peak finding algorithm
d.peaks <- peaks(d$K, 35)

## save a vector of the positions in the K signal where the peaks were identified
peak_idx <- which(d.peaks)

## simple plot of raw K signal
plot(K ~ x, data=d, type="l", cex=.25, main="Tioga1 35-65cm\nK Treatment", xlab="Deg. 2 Theta", ylab="Intensity", ylim=c(0,max(d$K) + 50))

## add peaks: note that we are sub0setting the original data by the peak location index
points(K ~ x, data=d[peak_idx, ], col = 2, cex = 1.5)

## compute peak d-spacings
peak_d_spacings <- signif( (0.154/ (sin(d$x[peak_idx] * pi/180))), 3)

## annotate d-spacings / or peak index
## text(d$x[peak_idx], d$K[peak_idx] + 20, peak_d_spacings, col='blue', cex=0.75 )
text(d$x[peak_idx], d$K[peak_idx] + 20, 1:length(peak_d_spacings), col='blue', cex=0.75 )

## print a simple table of d-spacings by index
 data.frame(peak_d_spacings)
   peak_d_spacings
1            1.370
2            1.040
3            0.751
4            0.638
5            0.563
6            0.488
7            0.452
8            0.435
9            0.367
10           0.347

Automatic location of peaks in an XRD dataset with R
Peaks found with two different tolerance settings.





Some ideas on annotating common d-spacings

## note that we need to define this function before we can use it
## put all of the plotting commands into a wrapper function:
plot_xrd <- function(d)
        {
        ## plot the difractograms offset by 200
        plot(MG ~ x, data=d, type="l", cex=.25, main="Tioga1 35-65cm", xlab="2 Theta", ylab="Intensity", xlim=c(2,32), ylim=c(0,1600), xaxt='none')
       
        ## add the other treatments
        lines(MG_GLY + 200 ~ x, data=d)
        lines(K + 400 ~ x, data=d)
        lines(K_350 + 600 ~ x, data=d)
        lines(K_550 + 800 ~ x, data=d)

        ## label the lines
        text(31.5, c(30,230,430,630,830), c("MG","MG+GLY","K25","K350","K550"))
       
        ## plot the zero-line on each graph:
        abline(h=c(0,200,400,600,800), lty=2, col="gray")
       
        ## plot the established boundaries to these common spacings
        abline(v=(asin(0.154/(2*c(0.71,0.73,0.72,0.75,0.99,1.01,1.24,1.28,1.4,1.5,1.77,1.8))) * 180/pi * 2), col="green", lty=2)
       
        ## plot some common d-spacings
        # abline(v=(asin(0.154/(2*c(.715,.73,1,1.2,1.4,1.8))) * 180/pi * 2), col="blue")
       
        ## annotate the lines, recall that d-spacing is in reverse order with respect to Two_theta
        text((asin(0.154/(2*c(.715,.73,1,1.2,1.4,1.8))) * 180/pi * 2), c(1300,1400,1400,1400,1400,1400), c(".715",".73","1.0","1.2","1.4","1.8"), col=1, cex=1)
       
        ## add the axis
        axis(1, 1:30)
        }


##
##
##
## load some sample data
d <- read.csv("tioga1_35-65.csv", header=FALSE)

## name the columns
names(d) <- c('x', 'MG', 'MG_GLY', 'K', 'K_350', 'K_550')

## run the wrapper function to plot the Tioga1 0-8cm data:
plot_xrd(d)

Example XRD plot 2: illustrating common d-spacingsExample XRD plot 2: illustrating common d-spacings

Two-page display of XRD data for an entire soil profile

Multi-horizon XRD sample plot 1
Sample 1
Multi-horizon XRD sample plot 2
Sample 2

Summarizing Grouped Data in R

A colleague of mine recently asked about computing basic summary statistics from grouped data in R. These are a couple examples that I suggested. Additional documentation for the plyr package can be found here.

 
Code Snippets

# load libraries
library(lattice) # nice looking plots
library(plyr) # advanced aggregation functions

# generate 100 random obs
set.seed(1)
x <- rnorm(100)

# generate treatment labels
treatment <- rep(letters[1:5], each=4)

# generate depth labels
depth <- rep(c('0-10', '10-20'), 50)

# combine into a single dataframe
d <- data.frame(x, treatment, depth)

# check out the dataframe:
str(d)
head(d)


# visually check data with box-whisker plot
bwplot(x ~ treatment:depth, data=d, scales=list(y=list(tick.number=10, cex=0.75), x=list(rot=45, cex=0.75)), ylab='Measured Variable', xlab='Treatment / Group')


# calculate median by treatment and depth
aggregate(d$x, by=list(d$treatment, d$depth), median)

# Group.1 is the treatment
# Group.2 is the depth
# x is the median
   Group.1 Group.2            x
1        a    0-10  0.382152173
2        b    0-10  0.347044867
3        c    0-10  0.384062345
4        d    0-10  0.499198983
5        e    0-10 -0.191705870
6        a   10-20 -0.005618922
7        b   10-20  0.066331780
8        c   10-20  0.328471014
9        d   10-20 -0.049369325
10       e   10-20 -0.097184000


# another approach using ddply()
# compute a summary by treatment X depth
# returning the result as a nice data frame
ddply(d, .(treatment, depth), function(i) summary(i$x))

# result looks like this:
   treatment depth    Min.  1st Qu.    Median      Mean 3rd Qu.   Max.
1          a  0-10 -0.8356 -0.46760  0.382200  0.376500  0.8635 2.4020
2          b  0-10 -1.8050 -0.55550  0.347000  0.006561  0.5673 1.0630
3          c  0-10 -0.5425 -0.04595  0.384100  0.371000  0.5507 1.5120
4          d  0-10 -1.3770 -0.38070  0.499200  0.339300  1.1520 1.5870
5          e  0-10 -1.2770 -0.43100 -0.191700 -0.115700  0.4459 1.1000
6          a 10-20 -1.9890 -0.22380 -0.005619 -0.079500  0.4634 1.5950
7          b 10-20 -1.4710 -0.60670  0.066330  0.013510  0.6370 1.4660
8          c 10-20 -0.7099 -0.25470  0.328500  0.360600  0.7653 2.1730
9          d 10-20 -2.2150 -0.80430 -0.049370 -0.126100  0.4917 1.9800
10         e 10-20 -1.0440 -0.54830 -0.097180 -0.057270  0.4457 0.9438

Using lm() and predict() to apply a standard curve to Analytical Data

R: Multi-figure plot of Carlo-Erba DataR: Multi-figure plot of Carlo-Erba Data

 
Load input data (see attached files at bottom of this page)

#first the sample data
#note that field sep might be different based on pre-formatting
cn <- read.table("deb_pinn_C_N-raw.final.txt", sep=" ", header=TRUE)

#then the standards:
cn_std <- read.table("deb_pinn_C_N-standards.final.txt", sep="\t", header=TRUE)

# comput simple linear models from standards
# "mg_nitrogen as modeled by area under curve"
lm.N <- lm(mg_N ~ area_N, data=cn_std)
lm.C <- lm(mg_C ~ area_C, data=cn_std)

# check std curve stats:
summary(lm.N)
 Multiple R-Squared: 0.9999,     Adjusted R-squared: 0.9999
summary(lm.C)
 Multiple R-Squared:     1,      Adjusted R-squared:     1

 
Apply the standard curve to the raw measurements

# note that the predict method is looking for column names that where originally
# used in the creation of the lm object
# i.e. area_N for lm.N  and area_C for lm.C
# therefore it is possible to pass the original data matrix with both
# values to predict(), while specifiying the lm object
cn$mg_N <- predict(lm.N, cn)
cn$mg_C <- predict(lm.C, cn)

 
Merge sample mass to calculate percent C/N by mass

#read in the initial mass data, note that by default string data will be read in as a factor
# i.e. factors are like treatments, and this data type will not work in some functions
cn.mass <- read.table("all_samples.masses.txt", header=TRUE, sep="\t")

#take a look at how the mass data was read in by read.table()
str(cn.mass)
'data.frame':   75 obs. of  5 variables:
 $ id         : <b>Factor</b> w/ 26 levels</b> "004K","007K",..: 15 16 17 18 19 20 21 22 23 24 ...
 $ pedon_id   : <b>Factor</b> w/ 18 levels "004K","007K",..: 15 15 15 18 18 18 17 17 17 16 ...
 $ horizon_num: int  2 5 7 2 4 6 2 4 5 2 ...
 $ sample_id  : <b>Factor</b> w/ 75 levels "A1","A10","A11",..: 23 24 14 15 16 25 29 30 31 32 ...
 $ sample_mg  : num  24.6 27.5 33.3 25.9 25.8 ...


# use the merge() function to join the two dataframes based on the cell_id column
#merge() does not work with columns of type "level"
# convert them to characters in upper case, and append them to the original dataframe:
# note that merge is case sensitive !!!
cn$cell_id <- toupper(as.character(cn$sample_id))
cn.mass$cell_id <- toupper(as.character(cn.mass$sample_id))

#only keep our pedon data, leave behind the checks
cn.complete <- merge(x=cn, y=cn.mass, by.x="cell_id", by.y="cell_id", sort=FALSE, all.y=TRUE)

##calculate percent N and C, appending to the cn.complete dataframe
cn.complete$pct_N <- (cn.complete$mg_N / cn.complete$sample_mg) * 100
cn.complete$pct_C <- (cn.complete$mg_C / cn.complete$sample_mg) * 100

#look at the results:
str(cn.complete)
'data.frame':   75 obs. of  13 variables:
 $ cell_id    : chr  "B8" "B9" "B10" "B11" ...
 $ sample_id.x: Factor w/ 81 levels "A1","A10","A11",..: 24 25 15 16 17 26 30 31 32 33 ...
 $ area_N     : num  2225431  208028  341264 1377688  168328 ...
 $ area_C     : num  85307240  8296664 14624760 50879560  6690868 ...
 $ mg_N       : num  0.09261 0.01096 0.01635 0.05830 0.00935 ...
 $ mg_C       : num  1.2609 0.1204 0.2141 0.7510 0.0967 ...
 $ id         : Factor w/ 26 levels "004K","007K",..: 15 16 17 18 19 20 21 22 23 24 ...
 $ pedon_id   : Factor w/ 18 levels "004K","007K",..: 15 15 15 18 18 18 17 17 17 16 ...
 $ horizon_num: int  2 5 7 2 4 6 2 4 5 2 ...
 $ sample_id.y: Factor w/ 75 levels "A1","A10","A11",..: 23 24 14 15 16 25 29 30 31 32 ...
 $ sample_mg  : num  24.6 27.5 33.3 25.9 25.8 ...
 $ pct_N      : num  0.3759 0.0398 0.0491 0.2254 0.0363 ...
 $ pct_C      : num  5.117 0.438 0.643 2.903 0.375 ...

#save the data for further processing:
write.table(cn.complete, file="cn.complete.table", col.names=TRUE, row.names=FALSE)

 
Measure the accuracy of the sensor in the machine with simple correlation

### get a measure of how accurate the sensor was, based on our checks:
#just the first 5 columns, in case there is extra
cn.checks <- cn[c(13,26,39,52,65,78),][1:5]

#make a list of the mg of ACE in each check
checks.mg_ACE <- c(0.798, 1.588, 1.288, 1.574, 1.338, 1.191)

#make a column of the REAL mg_N based on the percent N in ACE
cn.checks$real_mg_N  <- checks.mg_ACE * 0.104

#make a column of the REAL mg_C based on the percent C in ACE
cn.checks$real_mg_C  <- checks.mg_ACE * 0.711

# check with cor()

 
Create a mutli-figure diagnostic plot

layout(mat=matrix(c(1, 4, 2, 3), nc = 2, nr = 2), width=c(1,1), height=c(1,2))
#first the std curves
par(mar = c(4,4,2,2))

#Nitrogen
plot(mg_N ~ area_N, data=cn_std, xlab="Area Counts", ylab="mg", main="Std Curve for N", cex=0.7, pch=16, cex.axis=0.6)
rug(cn$area_N, ticksize=0.02, col="gray")
rug(cn$mg_N, ticksize=0.02, col="gray", side=2)
abline(lm.N, col="gray", lty=2)
points(cn$area_N, cn$mg_N, col="blue", cex=0.2, pch=16)

#Carbon
plot(mg_C ~ area_C, data=cn_std, xlab="Area Counts", ylab="mg", main="Std Curve for C", cex=0.7, pch=16, cex.axis=0.6)
rug(cn$area_C, ticksize=0.02, col="gray")
rug(cn$mg_C, ticksize=0.02, col="gray", side=2)
abline(lm.C, col="gray", lty=2)
points(cn$area_C, cn$mg_C, col="blue", cex=0.2, pch=16)
#possible problems
points(cn$area_C[which(cn$area_C > 1.0e+08)], cn$mg_C[which(cn$area_C > 1.0e+08)] , col="red")

#sample plot of carbon distributions within each pedon:
#note that las=2 makes axis labels perpendicular to axis
par(mar = c(7,4,4,2))

#boxplot illustrating the within-pedon variation of Carbon
boxplot(cn.complete$pct_C ~ cn.complete$pedon_id , cex.axis=0.6, boxwex=0.2, las=2, main="Percent Total Carbon", ylab="% C", xlab="Pedon ID", cex=0.4)

#boxplot illustrating the within-pedon variation of Nitrogen
boxplot(cn.complete$pct_N ~ cn.complete$pedon_id , cex.axis=0.6, boxwex=0.2, las=2, main="Percent Total Nitrogen", ylab="% N", xlab="Pedon ID", cex=0.4)

Visualization of Grouped Data with Lattice Graphics (in R)

 
Overview
The lattice package for R provides several convenient functions for plotting data which has some kind of internal structure, usually in the form of groups. Lattice plotting functions create common visualizations of data (scatter plots, box-and-whisker plots, etc.), within in grid of panels defined by one or more grouping variables. See the manual page for xyplot for documentation and examples. The author of the lattice package has posted some excellent examples with code snippets from the upcoming definitive book on the topic (Lattice: Multivariate Data Visualization with R, by Deepayan Sarkar).

 
Complex Plots with Lattice
With the convenience of lattice graphics comes a price- complex plots cannot be generated element by element (as is the case using base graphics). Instead, one of several panel functions must be used or a customized panel function must be written. There is extensive documentation on this, but not nearly enough for the special case of wanting a graph which includes both lines and point symbols. In addition, lattice functions require that all data to be plotted occur in the same dataframe.This example presents one possible solution to plotting grouped data (via lattice) which consists of both different symbolization (lines and points) and source dataframes.

 
Example
The following example was created to illustrate the changes in the shape of the logistic function that occur with 3 possible 'slope' terms (b), and 3 possible 'intercept' terms (a). Each combination of slope and intercept are used as grouping variables, such that the resulting figure will contain 9 panels- one for each combination of slope/intercept. The panels are labeled with the respective slope (green panel title) and intercept (yellow panel title). To demonstrate plotting of mixed symbol types, an unrelated set of binary data was generated and split into the same 9 groupings. See R code below for the full story.

Lattice Plot Example: Panels illustrate the effect of different slope and intercept terms for the logistic function.Lattice Plot Example: Panels illustrate the effect of different slope and intercept terms for the logistic function.

 
Generate 9 Possible Versions of the Logistic Function

## load required libs
library(lattice)

##
## the logistic function
## a is the intercept, b the slope, and x the value
f <- function(a, b, x) {p <- (exp(a + b*x)) / (1 + exp(a + b*x))}


## a data vector
x.seq <- seq(-5, 5, by=0.5)
x <- rep(x.seq, 9)

## generate some
## slope and intercept possibilities
a.seq <- c(-2, 0, 2)
b.seq <- c(-1, -0.5, -0.25)

## create data  frame of all possible combinations
## i.e. all slope/intercept combinations
d <- expand.grid(a=a.seq, b=b.seq, x=x.seq, KEEP.OUT.ATTRS=FALSE)

## add the probability values back to the main DF
d$p <- f(d$a,d$b,d$x)

 
Generate Some Fake Binary Data

## generate some data on the predictor scale:
rx <- runif(min=-5, max=5, n=20)

## using the same groupings from the above example
## combine into a DF
rd <- expand.grid(a=a.seq, b=b.seq, x=rx, KEEP.OUT.ATTRS=FALSE)

## add pretend binary data
rd$rp <- rbinom(n=length(rd$x), 1, 0.5)

 
Merge the Two Dataframes

## add a dummy column to the original dataframe for the binary data
d$rp <- NA

## add dummy col for real probability
rd$p <- NA

## combine the first set of data with the pretend probabilities
dd <- make.groups(d, rd)

 
Plot Lines and Points with xyplot

xyplot(p + rp ~ x | factor(a) + factor(b),
data=dd,
ylab='Probability',
xlab='Predictor Variable',
panel=panel.superpose,
distribute.type=TRUE,
col=c(1,2), lwd=c(2,1), type=c('l','p'), pch=c(NA, '|')
)

 
Conclusion
The trick to plotting multiple symbol types can be summarized with:

  1. include all response variables to be plotted in the left-hand side of the plotting formula : p + rp ~ .... Note that they must share a common predictor variable, in this case the column "x" was used.
  2. use the panel function panel=panel.superpose and its argument distribute.type=TRUE to allow for more than one plotting style
  3. specify plotting styles (line style, symbol type, line width, color, etc.) as a vector which contains as many elements as response variables from the plotting formula: col=c(1,2), lwd=c(2,1), type=c('l','p'), pch=c(NA, '|')

With this approach in mind, it is possible to generate complicated plots using lattice graphics when data is of multiple type (line vs. point) and comes from multiple source dataframes. A common example of this scenario might involve plotting the continuous predictions from a linear model and the original points used to create the model.

Lattice Plot Example 2: Data from a logistic regression model, including fitted response, standard error, and original data points.Lattice Plot Example 2: Data from a logistic regression model, including fitted response, standard error, and original data points.

Working with Spatial Data

A collection of notes, examples, references, and thoughts on working with spatial data.

Converting Alpha-Shapes into SP Objects

Just read about a new R package called alphahull (paper) that sounds like it might be a good candidate for addressing this request regarding concave hulls. Below are some notes on computing alpha-shapes and alpha-hulls from spatial data and converting the results returned by ashape() and ahull() into SP-class objects. Note that the functions are attached at the bottom of the page. Be sure to read the license for the alphahull package if you plan to use it in your work.

Alpha-Shape ExampleAlpha-Shape Example

## not widely tested!

# need these
library(sp)
library(spgrass6)
library(alphahull)

source('alpha-functions.R')

# read point vector in from GRASS
x <- readVECT6('rtk_pts_5_1')

# extract coordinates
x.coords <- coordinates(x)

# alpha-shape: 100 meter threshold
x.as <- ashape(x.coords[,1], x.coords[,2], alpha=100)

# alpha-hull: 30 meter threshold
x.ah <- ahull(x.coords[,1], x.coords[,2], alpha=30)


plot(x.as, cex=0.5, pch=4, xlab='Easting (m)', ylab='Northing (m)', main=expression(paste('100m ', alpha, '-Shape')), asp=1)

plot(x.ah, cex=0.5, pch=4, xlab='Easting (m)', ylab='Northing (m)', main=expression(paste('30m ', alpha, '-Hull')), asp=1)



## convert into SP objects

# alpha-shape
x.as.spldf <- ashape_to_SPLDF(x.as, proj4string=x@proj4string)

# alpha-hull
x.ah.spldf <- ahull_to_SPLDF(x.ah, proj4string=x@proj4string)

# check: OK
pdf(file='ashape_ahull_demo.pdf', width=6, height=6)
par(mar=c(1,1,1,1))
plot(x.as.spldf)
lines(x.ah.spldf, col='red')
points(x, cex=0.5, pch=4, col='blue')
legend('bottomright', legend=c(expression(paste('100m ', alpha, '-Shape')), expression(paste('30m ', alpha, '-Hull')), 'Observation'), lty=c(1,1,NA), pch=c(NA,NA,4), col=c('black', 'red', 'blue'), bty='n')
dev.off()


# save back to GRASS: OK
writeVECT6(x.as.spldf, 'rtk_ashape')

# save back to GRASS: OK
writeVECT6(x.ah.spldf, 'rtk_ahull')

Customizing Maps in R: spplot() and latticeExtra functions

I recently noticed the new latticeExtra page on R-forge, which contains many very interesting demos of new lattice-related functionality. There are strong opinions about the "best" graphics system in R (base graphics, grid graphics, lattice, ggplot, etc.)-- I tend to use base graphics for simple figures and lattice for depicting multivariate or structured data. The sp package defines classes for storing spatial data in R, and contains several useful plotting methods such as the lattice-based spplot(). This function, and back-end helper functions, provide a generalized framework for plotting many kinds of spatial data. However, sometimes with great abstraction comes great ambiguity-- many of the arguments that would otherwise allow fine tuning of the figure are buried in documentation for lattice functions. Examples are more fun than links to documentation, so I put together a couple of them below. They describe several strategies for placing and adjusting map legends-- either automatically, or manually added with the update() function. The last example demonstrates an approach for over-plotting 2 rasters. All of the examples are based on the meuse data set, from the gstat package.

Extended spplot() examplesExtended spplot() examples

 
Examples

# setup environment
library(gstat)
library(latticeExtra)
library(grid)

# load example data
data(meuse.grid)
data(meuse)
data(meuse.alt)

# convert to SpatialPointsDataFrame
coordinates(meuse.grid) <- ~ x + y
coordinates(meuse) <- ~ x + y
coordinates(meuse.alt) <- ~ x + y

# converto SpatialPixelsDataFram
gridded(meuse.grid) <- TRUE

# convert 'soil' to factor and re-label
meuse.grid$soil <- factor(meuse.grid$soil, labels=c('A','B','C'))
meuse$soil <- factor(meuse$soil, levels=c('1','2','3'), labels=c('A','B','C'))


#
# example 1
#

# setup color scheme
cols <- brewer.pal(n=3, 'Set1')
p.pch <- c(2,3,4)

# generate list of trellis settings
tps <- list(regions=list(col=cols), superpose.polygon=list(col=cols), superpose.symbol=list(col='black', pch=p.pch))

# init list of overlays
spl <- list('sp.points', meuse, cex=0.75, pch=p.pch[meuse$soil], col='black')

# setup trellis options
trellis.par.set(tps)

# initial plot, missing key
p1 <- spplot(meuse.grid, 'soil', sp.layout=spl, colorkey=FALSE, col.regions=cols, cuts=length(cols)-1)

# add a key at the top + space for key
p1 <- update(p1, key=simpleKey(levels(meuse.grid$soil), points=FALSE, lines=FALSE, rect=TRUE, regions=TRUE, columns=3, title='Class', cex=0.75))

# add a key on the right + space for key
p1 <- update(p1, key=simpleKey(levels(meuse$soil), points=TRUE, columns=1, title='Class', cex=0.75, space='right', ))

p1




#
# example 2
#

# generate list of trellis settings
tps <- list(regions=custom.theme()$regions, superpose.symbol=list(col='black', pch=p.pch), fontsize=list(text=16))

trellis.par.set(tps)
p2 <- spplot(meuse.grid, 'dist', sp.layout=spl, colorkey=list(space='bottom', title='Distance'), scales=list(draw=TRUE, cex=0.5))

p2 <- update(p2, key=simpleKey(levels(meuse$soil), points=TRUE, columns=1, space='right'))

p2


#
# example 3
# more colorkey tweaking and...
# overlay 2 grids with layer()
#



sp.grid <- function (obj, col = 1, alpha = 1, ...)
{
    if (is.character(obj))
        obj = get(obj)
    xy = coordinates(obj)
    if (length(col) != 1 && length(col) != nrow(xy)) {
    }
    gt = as(getGridTopology(obj), "data.frame")
    grid.rect(x = xy[, 1], y = xy[, 2], width = gt$cellsize[1],
        height = gt$cellsize[2], default.units = "native", gp = gpar(fill = col, col = NA, alpha = alpha))
}



trellis.par.set(regions=custom.theme()$regions, superpose.polygon=list(col='black', alpha=0.25))

# first grid covers entire extent
p3 <- spplot(meuse.grid, 'dist', colorkey=list(space='bottom', width=1, height=0.5, tick.number=3))

# overlay partially transparent, kind of a hack...
p3 <- p3 + layer(sp.grid(meuse.grid[meuse.grid$soil == 'A', ], col='black', alpha=0.25))

p3 <- update(p3, key=simpleKey('Shaded Region', points=FALSE, lines=FALSE, rect=TRUE, columns=1, space='top'))

p3



#
# example 4: merge all three together
#

# order matters
p4 <- c(p3,p2,p1, layout=c(3,1))
p4 <- update(p4, key=simpleKey(levels(meuse$soil), points=TRUE, columns=1, space='right'))

p4


# save to file: note that we have to reset the 'regions' colors
png(file='spplot_examples.png', width=700, height=350)
trellis.par.set(regions=custom.theme()$regions)
print(p4)
dev.off()

Generation of Sample Site Locations [sp package for R]

 
Premise
Setting up sampling designs is a non-trivial aspect to any field experiment that includes a spatial component. The sp package for R provides a simple framework for generating point sampling schemes based on region-defining features (lines or polygons) and a sampling type (regular spacing, non-aligned, random, random-stratified, hexagonal grid, etc.). The rgdal package provides a set of functions for importing/exporting common vector data formats. This example demonstrates simple import/export, iterating over sp objects, and reconstituting new objects from lists of objects. A more complex sampling scheme is demonstrated here.

  1. Setup the environment, load some sample polygons, and tryout the spsample() function.
    # load required packages
    library(sp)
    library(rgdal)

    # read data:
    # note the funky syntax
    # note that this should have a REAL projection defined
    # an incorrect definition may result in an error from readOGR
    x <- readOGR('polys/polys.shp', layer='polys')

    # spsample will not sample each polygon, rather it works on the union of polys
    # try it:
    plot(x) ; points(spsample(x, n=100, type='random'), col='red', pch=3, cex=0.5)
  2. Sampling with spsample example 1Sampling with spsample example 1

  3. Iterate through all polygons in our original dataset, generating approximately 100 sample points within each polygon. Note that we use sapply() it iterate through the list of polygons, and do.call('rbind', ...) to 'stack' the list elements back into a single SpatialPoints object.
    # hexagonal grid from lower-left corner
    s <- sapply(slot(x, 'polygons'), function(i) spsample(i, n=100, type='hexagonal', offset=c(0,0)))

    # we now have a list of SpatialPoints objects, one entry per polygon of original data
    plot(x) ; points(s[[4]], col='red', pch=3, cex=.5)

    # stack into a single SpatialPoints object
    s.merged <- do.call('rbind', s)
  4. Sampling with spsample example 2Sampling with spsample example 2

  5. Now that the sample points for each polygon have been merged into a single SpatialPoints object, we need to attach a dataframe with the ID associating each sample point with its parent polygon. Attaching this data will "promote" the SpatialPoints object to a SpatialPointsDataFrame object.
    # add an id, based on source polygon:
    #
    # extract the original IDs
    ids <- sapply(slot(x, 'polygons'), function(i) slot(i, 'ID'))

    # determine the number of ACTUAL sample points generated for each poly
    npts <- sapply(s, function(i) nrow(i@coords))

    # generate a reconstituted vector of point IDs
    pt_id <- rep(ids, npts)

    # promote to SpatialPointsDataFrame
    s.final <- SpatialPointsDataFrame(s.merged, data=data.frame(poly_id=pt_id))

    # check:
    plot(x) ; points(s.final, col=s.final$poly_id, pch=3, cex=0.5)
  6. Sampling with spsample example 3Sampling with spsample example 3

  7. Copy over the spatial reference system data from the polygons object, and save sample points to a new shapefile. Note that you can only write to a shapefile if the object in question is a SpatialPointsDataFrame object.
    # copy source data spatial reference system to new object
    s.final@proj4string <- x@proj4string

    # write out to new file
    writeOGR(s.final, dsn='polys/', layer='rnd_pts', driver='ESRI Shapefile')

Ordinary Kriging Example: GRASS-R Bindings

 
Update: 2012-02-13
Many of the examples used in this demonstration are now somewhat dated, probably inefficient, and in need of revision. I'll spend some time on an updated version for the GRASS wiki soon.

 
Overview:
A simple example of how to use GRASS+R to perform interpolation with ordinary kriging, using data from the spearfish sample dataset. This example makes use of the gstat library for R.

 
Helpful References:

  • Issaks, E.H. & Srivastava, R.M. An Introduction to Applied Geostatistics Oxford University Press, 1989
  • GSTAT Manual
  • GSTAT Examples

Elevation Data and Sample Points: 300 randomly placed points where elevation data was sampled.Elevation Data and Sample Points: 300 randomly placed points where elevation data was sampled.

 
Data Prep:
As a contrived example, we will generate 300 random points within the current region, and sample an elevation raster at each of these points.

# set region:
g.region rast=elevation.dem

# extract some random points from an elevation dataset
v.random out=rs n=300

# create attribute table:
v.db.addtable map=rs columns="elev double"

# extract raster data at points
v.what.rast vect=rs rast=elevation.dem column=elev

# simple display:
d.rast elevation.dem
d.vect rs size=4

# start R
R

 
Load GRASS Data into R:
Remember that you will need to install these R packages onto your computer.

##load libraries
library(gstat)
library(spgrass6)

## read in vector dataset from above
G <- gmeta6()
x.has.na <- readVECT6('rs')

# remove records with NA
x <- x.has.na[-which(is.na(x.has.na@data$elev)),]

## create a grid wihch will define the interpolation
## note that it is based on the current GRASS region settings
grd <- gmeta2grd()

## make a new grid of (1)s
## be sure to use original data's proj data...
## doesn't work with the stuff stored in G...
new_data <- SpatialGridDataFrame(grd, data=data.frame(k=rep(1,G$cols*G$rows)), proj4string=CRS(x@proj4string@projargs))

## optionally we can use another raster of 1's as our interpolation mask
mask <- as(readRAST6("rushmore"), "SpatialPixelsDataFrame")
## need to manually set the coordinate system information:
mask@proj4string <- x@proj4string
## this new object could then be used in the 'newdata' argument to predict(), i.e.
## x.pred_OK <- predict(g, id='elev', newdata=mask)

 
Variogram Modeling:
A very simple example, using default parameters for a non-directional variogram is presented below. Modeling the variogram for an actual spatial problem requires knowlege of both your dataset (distribution, collection methods, etc.), the natural processes involved (stationarity vs. anisotropy ?), and a bit about the assumptions of geostatistics.

## init our gstat object, with the model formula
## note that coordinates are auto-identified from the GRASS object
g <- gstat(id="elev", formula=elev ~ 1, data=x)

## view a variogram with specified parameters
plot(variogram(g, width=250, cutoff=10000, map=FALSE))

## optionally make a variogram map, and plot semi-variance for 10-point pairs or greater:
plot(variogram(g, width=250, cutoff=10000, map=TRUE), threshold=10)

## fit a linear variogram model- as this looks appropriate
## ... using default parameters
v.fit <- fit.variogram(variogram(g) ,vgm(model="Lin") )
plot(variogram(g, width=250, cutoff=10000, map=FALSE), model=v.fit)

## update gstat object with fitted variogram model
g <- gstat(g, id="elev", model=v.fit )

Variogram and Fitted Model: A Linear variogram model was fitted to the elevation data.Variogram and Fit Model: A Linear variogram model was fit to the elevation data.

 
Interpolation by Ordinary Kriging:
The prediction is done for every instance of a '1' in the object passed to the newdata= argument.

## interpolate with ord. kriging
x.pred_OK <- predict(g, id='elev', newdata=new_data)

 
Send Results Back to GRASS:

## write raster back to GRASS: interpolation and kriging variance:
## system('g.remove rast=elev.ok')
writeRAST6(x.pred_OK, 'elev.ok', zcol='elev.pred')
writeRAST6(x.pred_OK, 'elev.ok_var', zcol='elev.var')

## quit:
q()

 
Viewing Results in GRASS:

# reset colors to match original data:
r.colors map=elev.ok rast=elevation.dem

# give the kriging variance a grey color map
r.colors map=elev.ok_var color=rules <<EOF
0% white
100% black
EOF

#
# display the kriged interpolation, with intensity | saturation based on variance
d.his s=elev.ok_var h=elev.ok
# optional method:
# d.his i=elev.ok_var h=elev.ok
d.vect rs size=4

Interpolated Elevation Data via Ordinary Kriging: Hue is interpolated elevation value, saturation is based on the kriging variance.Interpolated Elevation Data via Ordinary Kriging: Hue is interpolated elevation value, saturation is based on the kriging variance.

 
Simple Comparison with RST:
RST (regularized splines with tension) and OK (ordinary kriging) are two common interpolation methods. Computing the RMSE (root-mean-square-error) between the interpolated raster and the original raster provides a simple quantitative measure of how well the interpolation performed, at least in terms mean magnitude of error. A spatial description of interpolation error can be generated by subtracting the new raster from the original. Note that the steps involve cell-wise computation of the square-error (SE), region-wise computation of the mean-square-error (MSE); the square root of MSE gives the root-mean-square-error or RMSE.

#
# compare with RST - tension of 60, same points
#
v.surf.rst in=rs elev=elev.rst zcol=elev tension=60
r.colors map=elev.rst rast=elevation.dem

# compute SE between kriged map and original
r.mapcalc "d = (elev.ok - elevation.dem)^2 "
r.colors map=d color=rainbow
d.rast d
d.vect rs size=4

# compute SE between RST map and original
r.mapcalc "d2 = (elev.rst - elevation.dem)^2"
r.colors map=d2 color=rainbow
d.rast d2
d.vect rs size=4

#
# compare results:
#

# compute RMSE for OK [sqrt(MSE)]
r.univar d

# compute RMSE for RST [sqrt(MSE)]
r.univar d2
# see table below:

 
Root-mean-square-error Comparison:
Looks like the RSME are pretty close...

Method OK RST
RMSE 61 meters 64 meters

Ordinary Kriging Example: R via text file

 
Overview:
A simple example of how to use R to perform interpolation with ordinary kriging, using data from a text file. This example makes use of the gstat library for R. Additional examples of how to use the following gstat functions are included:

  • variogram maps
  • directional variogram plots
  • ploting the interpolated surface directly from R

Note that this example is not meant to be an authoritative guide on variogram selection, or proper modeling of anisotropy-- just an example. The Kansas Geological Survey has an interesting set of reports that illustrate selection of a directional variogram in the presence of a strong, regional trend.

Elevation Data and Sample Points: 300 randomly placed points where elevation data was sampled.Original elevation data and sample points: 300 randomly placed points where elevation data was sampled.

 
Data Prep:
Export the coordinates and elevation values from the previous example. See attached file elev.txt.

# two new columns to the random point vector from the previous example
v.db.addcol map=rs columns="x double, y double"
# upload coordinates
v.to.db option=coor column=x,y map=rs
# export to text file
db.select rs fs="," > elev.csv

 
Start R:
Load in the text file, and coerce to format that gstat can use.

## load some libraries first:
library(gstat)
## load data
d <- read.csv('elev.csv')

## gstat does not like missing data, subset original data:
e <- na.omit(d)

## convert simple data frame into a spatial data frame object:
coordinates(e) <- ~ x+y

## test result with simple bubble plot:
bubble(e, zcol='elev', fill=FALSE, do.sqrt=FALSE, maxsize=2)

## create a grid onto which we will interpolate:
## first get the range in data
x.range <- as.integer(range(e@coords[,1]))
y.range <- as.integer(range(e@coords[,2]))

## now expand to a grid with 500 meter spacing:
grd <- expand.grid(x=seq(from=x.range[1], to=x.range[2], by=500), y=seq(from=y.range[1], to=y.range[2], by=500) )

## convert to SpatialPixel class
coordinates(grd) <- ~ x+y
gridded(grd) <- TRUE

## test it out:
plot(grd, cex=0.5)
points(e, pch=1, col='red', cex=0.7)
title("Interpolation Grid and Sample Points")

Interpolation GridInterpolation Grid

 
Create GSTAT Objects:
Make some diagnostic plots, model variogram, check for anisotropy, etc.

## make gstat object:
g <- gstat(id="elev", formula=elev ~ 1, data=e)

## the original data had a large north-south trend, check with a variogram map
plot(variogram(g, map=TRUE, cutoff=4000, width=200), threshold=10)

## another approach:
# create directional variograms at 0, 45, 90, 135 degrees from north (y-axis)
v <- variogram(g, alpha=c(0,45,90,135))

## 0 and 45 deg. look good. lets fit a linear variogram model:
## an un-bounded variogram suggests additional source of anisotropy... oh well.
v.fit <- fit.variogram(v, model=vgm(model='Lin' , anis=c(0, 0.5)))

## plot results:
plot(v, model=v.fit, as.table=TRUE)

## update the gstat object:
g <- gstat(g, id="elev", model=v.fit )

Variogram MapVariogram Map

Directional Variogram PlotsDirectional Variogram Plots

 
Perform OK and View Results:
Examples using standard and lattice graphics.

## perform ordinary kriging prediction:
p <- predict(g, model=v.fit, newdata=grd)

## visualize it:

## base graphics
par(mar=c(2,2,2,2))
image(p, col=terrain.colors(20))
contour(p, add=TRUE, drawlabels=FALSE, col='brown')
points(e, pch=4, cex=0.5)
title('OK Prediction')

## lattice graphics: thanks for R. Bivand's advice on this
##
## alternatively plot quantiles with
## ... col.regions=terrain.colors(6), cuts=quantile(p$elev.pred) ...
##
pts <- list("sp.points", e, pch = 4, col = "black", cex=0.5)
spplot(p, zcol="elev.pred", col.regions=terrain.colors(20), cuts=19, sp.layout=list(pts), contour=TRUE, labels=FALSE, pretty=TRUE, col='brown', main='OK Prediction')

## plot the kriging variance as well
spplot(p, zcol='elev.var', col.regions=heat.colors(100), cuts=99, main='OK Variance',sp.layout=list(pts) )

## quit and convert saved EPS files to PNG:
## for i in *.eps ; do convert $i `basename $i .eps`.png ; done

OK Prediction: created with the spplot() functionOK Prediction: created with the spplot() function

OK VarianceOK Variance

Point-process modelling with the sp and spatstat packages

 
Some simple examples of importing spatial data from text files, converting between R datatype, creation of a point process model and evaluating the model. Input data sources are: soil pit locations with mollic and argillic diagnostic horizons (mollic-pedons.txt and argillic-pedons.txt), and a simplified outline of Pinnacles National Monument (pinn.txt). The outline polygon is used to define a window in which all operations are conducted.

The 'sp' package for R contains the function spsample(), can be used to create a sampling plan for a given region of interest: i.e. the creation of n points within that region based on several algorithms. This example illustrates the creation of 50 sampling points within Pinnacles, according to the following criteria: regular (points are located on a regular grid), nonaligned (points are located on a non-aligned grid-like pattern), random (points are located at random), stratified (collectively exhaustive, see details here).

The 'spatstat' package for R contains several functions for creating point-process models: models describing the distribution of point events: i.e. the distribution of tree species within a forest. If covariate data is present (i.e. gridded precipitation, soil moisture, aspect, etc.) these covariates can be incorporated into the point-process model. Without covariate data, the model is based on an spatial distribution estimator function. Note that the development of such models is complicated by factors such as edge-effects, degree of stochasticity, spatial connectivity, and stationarity. These are rather contrived examples, so please remember to read up on any functions you plan to use for your own research. An excellent tutorial on Analyzing spatial point patterns in R was recently published.

 
Helpful links
Spatstat Quick Reference
Print Version with Links

R: sampling design using the sp packageFour sampling designs

R: spatial density analysis with spatstat packageSpatial density of each pedon type

R: spatial density analysis with spatstat package 2Spatial density of the four sampling designs

R: Example point-process model of mollic soilsExample point-process model of mollic soils

R: Diagnostics of a simple point-process modelDiagnostics of a simple point-process model

 
Note: This code should be updated to use the slot() function instead of the '@' syntax for accessing slots!

 
Load required packages and input data (see attached files at bottom of this page)

# load required packages
library(sp)
library(spatstat)
 
# read in pinnacles boundary polygon: x,y coordinates
# use the GRASS vector, as it should be topologically correct
# v.out.ascii format=standard in=pinn_bnd > pinn.txt ... edit out extra crud
pinn <- read.table('pinn.txt')
 
# read in mollic and argillic pedons
# see ogrinfo hack
m <- read.table('mollic-pedons.txt', col.names=c('x','y'))
a <- read.table('argillic-pedons.txt', col.names=c('x','y'))
 
# add a flag for the type of pedon
m$type <- factor('M')
a$type <- factor('A')
 
#combine into a single dataframe
pedons <- rbind.data.frame(a,m)

 
Using the functions from the 'sp' package create a polygon object from the pinn.txt coordinates

# create a spatial polygon class object
pinn.poly <- SpatialPolygons(list(Polygons(list(Polygon( pinn )), "x")))
 
# inspect this new object with str()
# total area of all polygons
pinn.poly@polygons[[1]]@area
 
# coordinates of first polygon: this is rough syntax!
pinn.poly@polygons[[1]]@Polygons[[1]]@coords

 
Generate a sampling plan for 50 sites using regular grid, non-aligned grid, random, and random stratified approaches

# generate random points within the pinnacled boundary
p.regular <- spsample(pinn.poly, n = 50, "regular")
p.nalign <- spsample(pinn.poly, n = 50, "nonaligned")
p.random <- spsample(pinn.poly, n = 50, "random")
p.stratified <- spsample(pinn.poly, n = 50, "stratified")
 
# setup plot environment
par(mfrow=c(2,2))
 
# each of the sampling designs
plot(pinn.poly)
title("Regular")
points(p.regular, pch=16, col='red', cex=0.3)
 
plot(pinn.poly)
title("Nonaligned")
points(p.nalign, pch=16, col='red', cex=0.3)
 
plot(pinn.poly)
title("Random")
points(p.random, pch=16, col='red', cex=0.3)
 
plot(pinn.poly)
title("Stratified")
points(p.stratified, pch=16, col='red', cex=0.3)

 
Convert 'sp' class objects to 'spatstat' analogues note the use of 'slots'

# pinn boundary:
# extract coordinates: and get a length - 1 value
p.temp <- pinn.poly@polygons[[1]]@Polygons[[1]]@coords
n <- length(p.temp[,1]) - 1
 
# create two vectors: x and y
# these will contain the reversed vertices, minus the last point
# in order to adhere to the spatstat specs: no repeating points, in counter-clockwise order
x <- rev(p.temp[,1][1:n])
y <- rev(p.temp[,2][1:n])
 
# make a list of coordinates: note that we are removing the last vertex
p.list <- list(x=x,y=y)
 
# make a spatstat window object from the polygon vertices
W <- owin(poly=p.list)
 
# pedons with their 'marks' i.e. pedon type, and the pinn boundary as the 'window'
pedons.ppp <- ppp(pedons$x, pedons$y, xrange=c(min(pedons$x), max(pedons$x)), yrange=c(min(pedons$y), max(pedons$y)) , window=W, marks=pedons$type)

 
Plot and summarize the new combined set of pedon data

# plot and summarize the pedons data:
# note the method used to subset the two 'marks'
par(mfrow=c(1,2))
plot(density.ppp(pedons.ppp[pedons.ppp$marks == 'M']), main="Mollic Point Density")
points(pedons.ppp[pedons.ppp$marks == 'M'], cex=0.2, pch=16)
 
plot(density.ppp(pedons.ppp[pedons.ppp$marks == 'A']), main="Argillic Point Density")
points(pedons.ppp[pedons.ppp$marks == 'A'], cex=0.2, pch=16)
 
summary(pedons.ppp)

Marked planar point pattern: 151 points
Average intensity 1.38e-06 points per unit area
Marks:
frequency proportion intensity
A 62 0.411 5.67e-07
M 89 0.589 8.14e-07
 
Window: polygonal boundary
single connected closed polygon with 309 vertices
enclosing rectangle: [ 657228.3125 , 670093.8125 ] x [ 4030772.75 , 4047986.25 ]
Window area = 109337135.585938

 
Convert the sampling design points (from above) to 'spatstat' objects and plot their density

# convert the random datasets: using the same window:
ppp.regular <- ppp(p.regular@coords[,1], p.regular@coords[,2], window=W)
ppp.nalign <- ppp(p.nalign@coords[,1], p.nalign@coords[,2], window=W)
ppp.random <- ppp(p.random@coords[,1], p.random@coords[,2], window=W)
ppp.stratified <- ppp(p.stratified@coords[,1], p.stratified@coords[,2], window=W)
 
# visually check density of random points:
par(mfrow=c(2,2))
plot(density.ppp(ppp.regular), main="Regular Sampling")
points(ppp.regular, pch=16, cex=0.2)
 
plot(density.ppp(ppp.nalign), main="Non-Aligned Sampling")
points(ppp.nalign, pch=16, cex=0.2)
 
plot(density.ppp(ppp.random), main="Random Sampling")
points(ppp.random, pch=16, cex=0.2)
 
plot(density.ppp(ppp.stratified), main="Stratified Sampling")
points(ppp.stratified, pch=16, cex=0.2)

 
Simple, and probably flawed attempt to use a point-process model for the pedon data Third order polynomial model for the distribution of pedons with a mollic epipedon. See manula page for ppm() for detailed examples.

# model the spatial occurance of Mollic epipedons with a 3rd-order polynomial, using the Poisson Process Theory
fit <- ppm( unmark(pedons.ppp[pedons.ppp$marks == 'M']), ~polynom(x,y,3), Poisson())
 
# view the fitted model
par(mfcol=c(2,2))
plot(unmark(pedons.ppp[pedons.ppp$marks == 'M']), main="Mollic Pedons")
plot(fit)
 
# plot some diagnostics on the fitted model: Pearson residuals (see references)
diagnose.ppm(fit, type="pearson")

 
#another example using a buil-in dataset: the Lansing Forest
# fit nonstationary marked Poisson process
# with different log-cubic trend for each species
data(lansing)
fit <- ppm(lansing, ~ marks * polynom(x,y,3), Poisson())
plot(fit)

 
Point-process model diagnostic references from the spatstat manual
Baddeley, A., Turner, R., Moller, J. and Hazelton, M. (2005) Residual analysis for spatial point processes. Journal of the Royal Statistical Society, Series B 67, 617–666.
Stoyan, D. and Grabarnik, P. (1991) Second-order characteristics for stochastic structures connected with Gibbs point processes. Mathematische Nachrichten, 151:95–100.

Simple Map Creation

library(maps)

map('county', 'ca', interior=TRUE)
map.scale()
map.axes()

## add some user data in lon/lat format:
points(x, pch=4, cex=0.5, col=1)
points(y, pch=4, cex=0.5, col=2)
points(z, pch=4, cex=0.5, col=3)

Example MapExample Map

Some Ideas on Interpolation of Categorical Data

 

Premise

Wanted to make something akin to an interpolated surface for some spatially auto-correlated categorical data (presence/absence). I quickly generated some fake spatially auto-correlated data to work with using r.surf.fractal in GRASS. These data were converted into a binary map using an arbitrary threshold that looked about right-- splitting the data into something that looked 'spatially clumpy'.

Categorical Interpolation 1: Simulated auto-correlated, categorical variable, with sampling points and derived voronoi polygons.Fig. 1: Simulated auto-correlated, categorical variable, with sampling points and derived voronoi polygons.

I had used voronoi polygons in the past to display connectivity of categorical data recorded at points, even though sparsely sampled areas tend to be over emphasized. Figure 1 shows the fake spatially auto-correlated data (grey = presence /white = not present), sample points (yellow boxes), and voronoi polygons. The polygons with thicker, red boundaries represent the "voronoi interpolation" of the categorical feature.

 

Interpolation by RST

Wanting something a little more interesting, I tried interpolating the presence/absence data by via RST. Numerical interpolation of categorical data is usually not preferred as it creates a continuum between discreet classes-- i.e. values that do not have a sensible interpretation. Throwing that aside for the sake of making a neat map, a color scheme was selected to emphasize the categorical nature of the interpolated surface (Figure 2).

Categorical Interpolation 2: RST interpolation of 0-1 continuum: red=1, blue=0.Fig. 2: RST interpolation of 0-1 continuum: red=1, blue=0.

 

Conditional Indicator Simulation

Finally, I gave conditional indicator simulation a try-- this required two steps: 1) fitting a model variogram, 2) simulation. This approach generates different output on each simulation, however, the output represents the original spatial pattern and variability. A more interesting map could be generated by running 1000 simulations and converting them into a single probability map.

Indicator Variogram: Empirical semi-variogram for indicator=1, with spherical model fit.Indicator Variogram: Empirical semi-variogram for indicator=1, with spherical model fit.

Categorical Interpolation 3: Single conditional indicator simulation.Categorical Interpolation 3: Single conditional indicator simulation.

 

Comparison

Categorical Interpolation 1: Simulated auto-correlated, categorical variable, with sampling points and derived voronoi polygons. Categorical Interpolation 2: RST interpolation of 0-1 continuum: red=1, blue=0. Categorical Interpolation 3: Single conditional indicator simulation.

 

Code Snippets

 
Generate Some Data in GRASS

# set a reasonable resolution
g.region res=10 -ap

# simulate some spatially auto-correlated data
# and convert to categorical map
r.surf.fractal --o dimension=2.5 out=fractal
r.mapcalc "fractal.bin = if(fractal > 0, 1, 0)"
r.colors fractal.bin color=rules <<EOF
0 white
1 grey
EOF

v.random --o out=v n=100
v.db.addtable map=v
v.db.addcol map=v column="fractal double, class integer"
v.what.rast vect=v rast=fractal column=fractal
v.what.rast vect=v rast=fractal.bin column=class

# simplest approach
v.voronoi --o in=v out=v_vor

# try interpolation of classes...
v.surf.rst --o in=v zcol=class elev=v.interp
r.colors map=v.interp color=rules <<EOF
0% blue
0.5 white
100% red
EOF

 
Perform Indicator Simulation in R

# indicator simulation
library(spgrass6)
library(gstat)

# read vector
d <- readVECT6('v')

# convert class to factor
d@data <- transform(d@data, class=factor(class))

# inspect variogram of x$class == 1
plot(v <- variogram(I(class == 1) ~ 1, data = d))

# fit a spherical variogram with nugget
# not sure about the syntax
v.fit <- fit.variogram(v, vgm(psill=1, model='Sph', range='250', 1))

# png(file='indicator_variogram.png', width=400, height=400, bg='white')
plot(v, model=v.fit)
# dev.off()

# make a grid to predict onto
G <- gmeta6()
grd <- gmeta2grd()

# new grid
new_data <- SpatialGridDataFrame(grd, data=data.frame(k=rep(1,G$cols*G$rows)), proj4string=CRS(d@proj4string@projargs))


# conditional indicator simulation:
# need to study this syntax
# make more simulations for an estimated probabilitry
p <- krige(I(class == 1) ~ 1, d, new_data, v.fit, nsim=1, indicators=TRUE, nmax=40)

# write one back to GRASS
writeRAST6(p, 'indicator.sim', zcol='sim1')

 
Make Some Maps in GRASS

# fix colors of the simulated map
r.colors map=indicator.sim color=rules << EOF
0 white
1 grey
EOF


# simple maps
d.erase
d.rast v.interp
d.vect v icon=basic/box  size=7 fcol=yellow
d.vect v_vor type=area fcol=none where=class=0
d.vect v_vor type=area fcol=none width=2 where=class=1
d.out.file --o out=example1

d.erase
d.rast fractal.bin
d.vect v icon=basic/box  size=7 fcol=yellow
d.vect v_vor type=area fcol=none where=class=0
d.vect v_vor type=area fcol=none col=red width=2 where=class=1
d.out.file --o out=example2

d.erase
d.vect v_vor type=area fcol=white where=class=0
d.vect v_vor type=area fcol=grey where=class=1
d.vect v icon=basic/box  size=7 fcol=yellow
d.out.file --o out=example3

d.erase
d.rast indicator.sim
d.vect v icon=basic/box size=7 fcol=yellow
d.out.file --o out=example4

Target Practice and Spatial Point Process Models

 
Overview:
Simple application of spatial point-process models to spread patterns after some backyard target practice. Note that only a cereal box and 2 sheets of graph paper were injured in this exercise. Data files are attached at the bottom of this page; all distance units are in cm.

A simple experiment was conducted, solely for the purpose of collecting semi-random coordinates on a plane, where a target was hit with 21 shots from a distance of 15 and 30 feet. The ppm() function (spatstat package) in R was used to create point density maps, along with a statistical description of the likelihood of where each target would be hit were the experiment to be conducted again (via point-process modeling). While normally used to model the occurrence of natural phenomena or biological entities, point-process models can be used to analyze one's relative accuracy at set target distances. One more way in which remote sensing or GIS techniques can be applied to smaller, non-georeferenced coordinate systems.

Density ComparisonDensity ComparisonPattern densities from the two experiments: 30 and 15 feet from target.

 
Load Data and Compute Density Maps:

### load some libraries
library(spatstat)
library(RColorBrewer)

## read in the data
t_30 <- read.csv('target_30.csv')
t_15 <- read.csv('target_15.csv')

## an initial plot
plot(t_30, xlim=c(0,35), ylim=c(0,50))
points(t_15, col='red')

## convert to spatstat objects
t_30.ppp <- ppp(t_30$x, t_30$y, xrange=c(0,35), yrange=c(0,50) )
t_15.ppp <- ppp(t_15$x, t_15$y, xrange=c(0,35), yrange=c(0,50) )

## check via plot
plot(t_30.ppp)
points(t_15.ppp, col='red')

 
Fit Point-Process Models:

## fit point-process model
t_30_fit <- ppm(t_30.ppp, ~polynom(x,y,3), Poisson())
t_15_fit <- ppm(t_15.ppp, ~polynom(x,y,3), Poisson())

## plot density comparisons between two ranges
par(mfcol=c(1,2))
plot( density(t_30.ppp), col=brewer.pal('Blues', n=9), main="30 Feet")
points(t_30.ppp, pch=4, cex=1)

plot( density(t_15.ppp), col=brewer.pal('Oranges', n=9), main="15 Feet")
points(t_15.ppp, pch=4, cex=1)


##
## plot a fit of the 30 foot pattern
##
par(mfcol=c(2,2))
plot( density(t_30.ppp), col=brewer.pal('Blues', n=9), main="30 Feet")
points(t_30.ppp, pch=4, cex=1)

plot(t_30_fit, col=brewer.pal('Blues', n=9), trend=TRUE, cif=FALSE, pause=FALSE, how="image")
plot(t_30_fit, trend=TRUE, cif=FALSE, pause=FALSE, how="contour")
plot(t_30_fit, colmap=brewer.pal('Blues', n=9), trend=TRUE, cif=FALSE, pause=FALSE, how="persp", theta=0, phi=45)


##
## plot a fit of the 15 foot pattern
##
par(mfcol=c(2,2))
plot( density(t_15.ppp), col=brewer.pal('Oranges', n=9), main="15 Feet")
points(t_15.ppp, pch=4, cex=1)

plot(t_15_fit, col=brewer.pal('Oranges', n=9), trend=TRUE, cif=FALSE, pause=FALSE, how="image")
plot(t_15_fit, trend=TRUE, cif=FALSE, pause=FALSE, how="contour")
plot(t_15_fit, colmap=brewer.pal('Oranges', n=9), trend=TRUE, cif=FALSE, pause=FALSE, how="persp", theta=0, phi=45)

30 Foot PPM30 Foot PPM

15 Foot PPM15 Foot PPM

 
Tidy-up:

##
## convert to png:
for i in *.pdf ; do convert -density 300 +antialias $i `basename $i .pdf`.png ; done
for i in *.png ; do mogrify -reisize 25% $i ; done

Tips on Using the 'sp' classes in R

 
Premise
The sp package provides several useful classes and methods for working with spatial data in R. The documentation is complete for the main set of classes and methods, however there are few listing of extensive examples that may cover some of the less frequently encountered situations. Some examples of where the complexity of the sp classes can cause confusion include:

  • dealing with NA values
  • preserving NA values during import/export
  • replacing data attached to an sp class object
  • plotting several spatial data at once

In addition, there are still some minor bugs to be worked out in terms of reading/writing vector data using methods in the rgdal package. This is currently being worked on by members of the sp and GRASS development community.

 
Joining New Data to an Existing sp Object

# use to read in some vector data
library(rgdal)

# read something in, rows are identified by a column called 'id'
spatial.data <- readOGR(...)

# read in some tabular data, rows are identified by a column called 'id'
new_table <- read.csv(...)

# 'join' the new data with merge()
# all.x=TRUE is used to ensure we have the same number of rows after the join
# in case that the new table has fewer
merged <- merge(x=spatial.data@data, y=new_table, by.x='id', by.y='id', all.x=TRUE)

# generate a vector that represents the original ordering of rows in the sp object
correct.ordering <- match(spatial.data@data$id, merged$id)

# overwrite the original dataframe with the new merged dataframe, in the correct order
spatial.data@data <- merged[correct.ordering, ]

# check the ordering of the merged data, with the original spatial data
cbind(spatial.data@data$id, merged$id[correct.ordering])

 
Correctly Write 'NA' Values to Shapefile [bug in writeOGR()]

# libraries we need
require(rgdal)
require(foreign)

# pass 1: write the shapefile
writeOGR(spatial.data, dsn='new_folder', driver='ESRI Shapefile', layer='new_layer')

# re-make the DBF:
write.dbf(spatial.data@data, file='new_folder/new_layer.dbf')

 
Handle NA in Several Raster Images (when all maps share the same NA spatial distribution)

# need this to read in GRASS data
require(spgrass6)

# read in several rasters into a new sp object:
x <- readRAST6(c('r1','r2','r3','r4'))

# we want to do some analysis that will not work when all rasters contain NA
# save a vector of positions where all of the rasters have a value of NA
x.nas <- which(is.na(x@data$r1) & is.na(x@data$r2) & is.na(x@data$r3) & is.na(x@data$r4))

# save a vector of positions that have no NA in any raster
x.vals <- which( !is.na(x@data$r1) & !is.na(x@data$r2) & !is.na(x@data$r3) & !is.na(x@data$r4))

# extract the records that only have complete data (no NA)
x.no.na <- x@data[x.vals, ]

# do something with the data
require(cluster)
x.clara <- clara(x.no.na, k=6, stand=TRUE)

# add a new column to the original sp object to contain the results of the analysis
x@data$terrain_group <- NA

# fill in the locations where there were complete records with the results
# note that we use the original locations of the non-NA values
x@data$terrain_group[x.vals] <- x.clara$clustering

# save result back to GRASS
writeRAST6(x, zcol='cluster', vname='x_groups')

 
Handle NA in Several Raster Images (when maps have different NA spatial distribution)

# this one is from R. Bivand (thanks)
# assume that we are working with the 'data' slot on an sp opject:
# generate some data; sprinkle in some NA ; convert to DF
x <- rnorm(100)
x[sample(1:100, 5)] <- NA
x.mat <- round(matrix(x, ncol=5), 2)
x.df <- as.data.frame(x.mat)

# extract an NA mask
x.na_mask <- complete.cases(x.df)
 
# just the 'rows' with complete data (across bands)
 x.df[x.na_mask,]

      V1    V2    V3    V4    V5
1   1.32 -1.27  0.00 -0.87  1.52
2  -0.44  1.33  0.78 -1.27  0.00
5   2.70 -0.08 -0.77  0.38  0.32
6  -0.52 -0.07  0.13 -0.43  1.12
7  -0.63  2.02  0.45  0.48 -0.59
8  -0.52 -0.78 -0.59 -0.07 -0.08
11 -1.37 -1.15  0.23  0.73  0.07
12 -0.11 -0.66  0.50 -1.14  0.71
13 -0.38 -0.08  1.00 -0.88 -0.19
14 -1.14 -0.45 -1.37 -0.43 -2.18
15  1.92  0.46 -0.72  0.12  0.27
16  0.51  0.22 -0.39 -1.54 -1.04
17  1.14 -1.11  1.04  0.00 -1.11
18  0.53  0.62 -0.10 -0.17  0.15
19  0.84  0.27 -1.01  0.87  0.31
20  0.82  0.87  0.46  1.35  0.13


# all the data
x.df
      V1    V2    V3    V4    V5
1   1.32 -1.27  0.00 -0.87  1.52
2  -0.44  1.33  0.78 -1.27  0.00
3  -0.31 -0.13 -1.14  2.08    NA
4   0.76    NA -1.20 -0.54 -0.85
5   2.70 -0.08 -0.77  0.38  0.32
6  -0.52 -0.07  0.13 -0.43  1.12
7  -0.63  2.02  0.45  0.48 -0.59
8  -0.52 -0.78 -0.59 -0.07 -0.08
9  -0.38 -0.51  0.44    NA -0.81
10 -1.38    NA -0.65 -0.50    NA
11 -1.37 -1.15  0.23  0.73  0.07
12 -0.11 -0.66  0.50 -1.14  0.71
13 -0.38 -0.08  1.00 -0.88 -0.19
14 -1.14 -0.45 -1.37 -0.43 -2.18
15  1.92  0.46 -0.72  0.12  0.27
16  0.51  0.22 -0.39 -1.54 -1.04
17  1.14 -1.11  1.04  0.00 -1.11
18  0.53  0.62 -0.10 -0.17  0.15
19  0.84  0.27 -1.01  0.87  0.31
20  0.82  0.87  0.46  1.35  0.13

# generate something cool from complete data
library(MASS)
p <- princomp(~ . , data=x.df[x.na_mask,])

# assign back to original data
x.df$pca_1[x.na_mask] <-  predict(p)[,1]

Visual Interpretation of Principal Coordinates (of) Neighbor Matrices (PCNM)

Principal Coordinates (of) Neighbor Matrices (PCNM) is an interesting algorithm, developed by P. Borcard and P. Legendre at the University of Montreal, for the multi-scale analysis of spatial structure. This algorithm is typically applied to a distance matrix, computed from the coordinates where some environmental data were collected. The resulting "PCNM vectors" are commonly used to describe variable degrees of possible spatial structure and its contribution to variability in other measured parameters (soil properties, species distribution, etc.)-- essentially a spectral decomposition spatial connectivity. This algorithm has been recently updated by and released as part of the PCNM package for R. Several other implementations of the algorithm exist, however this seems to be the most up-to-date.

 
Related Presentations and Papers on PCNM

  • http://biol09.biol.umontreal.ca/ESA_SS/Borcard_&_PL_talk.pdf
  • Borcard, D. and Legendre, P. 2002. All-scale spatial analysis of ecological data by means of principal coordinates of neighbour matrices. Ecological Modelling 153: 51-68.
  • Borcard, D., P. Legendre, Avois-Jacquet, C. & Tuomisto, H. 2004. Dissecting the spatial structures of ecologial data at all scales. Ecology 85(7): 1826-1832.

I was interested in using PCNM vectors for soils-related studies, however I encountered some in difficulty interpreting what they really meant when applied to irregularly-spaced site locations. As a demonstration, I generated several (25 to be exact) PCNM vectors from a regular grid of points. Using an example from the PCNM manual page, I have plotted the values of the PCNM vectors at the grid nodes (below). The interpretation of the PCNM vectors derived from a 2D, regular grid is fairly simple: lower order vectors represent regional-scale groupings, higher order vectors represent more local-scale groupings. One thing to keep in mind is that these vectors give us a multi-scale metric for grouping sites, and are not computed by any properties that may have been measured at the sites. The size of the symbols are proportional to the PCNM vectors and the color represents the sign.

PCNM - Regular GridPCNM - Regular Grid

Soil survey operations are rarely conducted on a regular grid, so I re-computed PCNM vectors from the same simulated grid, but after randomly perturbing each site. The resulting map of PCNM vectors is presented below. The patterns are a little complex, but quickly decipherable with guidance from the PCNM vectors derived from a regular grid. Neat!

PCNM - Randomly Perturbed Regular GridPCNM - Randomly Perturbed Regular Grid

 
R code used to make figures

library(ade4)
library(PCNM)

# fake data
g <- expand.grid(x=1:10, y=1:10)
x.coords <- data.frame(x=g$x, y=g$y)

# PCNM
x.sub.dist <- dist(x.coords[,c('x','y')])
x.sub.pcnm <- PCNM(x.sub.dist, dbMEM=TRUE)

# plot first 25 PCNM vectors
pdf(file='PCNM-grid-example.pdf', width=10, height=10)

par(mfrow=c(5,5))
for(i in 1:25)
        s.value(x.coords[,c('x','y')], x.sub.pcnm$vectors[,i], clegend=0, sub=paste("PCNM", i), csub=1.5, addaxes=FALSE, origin=c(1,1))

dev.off()


# jitter the same input and try again
x.coords <- data.frame(x=jitter(g$x, factor=2), y=jitter(g$y, factor=2))
x.sub.dist <- dist(x.coords[,c('x','y')])
x.sub.pcnm <- PCNM(x.sub.dist, dbMEM=TRUE)

# plot first 25 PCNM vectors
pdf(file='PCNM-jittered_grid-example.pdf', width=10, height=10)

par(mfrow=c(5,5))
for(i in 1:25)
        s.value(x.coords[,c('x','y')], x.sub.pcnm$vectors[,i], clegend=0, sub=paste("PCNM", i), csub=1.5, addaxes=FALSE, origin=c(1,1))

dev.off()

Visualizing Random Fields and Select Components of Spatial Autocorrelation

 
Premise
I have always had a hard time thinking about various parameters associated with random fields and empirical semi-variograms. The gstat package for R has an interesting interface for simulating random fields, based on a semi-variogram model. It is possible to quickly visualize the effect of altering semi-variogram parameters, by "seeding" the random number generator with the same value at each iteration. Of primary interest were visualization of principal axis of anisotropy, semi-variogram sill, and semi-variogram range. The code used to produce the images is included below. For more information on the R implementation of gstat, see the R-sig-GEO mailing list.

 
Setup

# load libraries
library(gstat)

# setup a grid
xy <- expand.grid(1:100, 1:100)
names(xy) <- c("x","y")

Demonstration of Anisotropy DirectionDemonstration of Anisotropy Direction

 
Demonstrate Anisotropy Direction

var.model <- vgm(psill=1, model="Exp", range=15)
set.seed(1)
sim <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata = xy, nsim = 1)

var.model <- vgm(psill=1, model="Exp", range=15, anis=c(0, 0.5))
set.seed(1)
sim$sim2 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

var.model <- vgm(psill=1, model="Exp", range=15, anis=c(45, 0.5))
set.seed(1)
sim$sim3 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

var.model <- vgm(psill=1, model="Exp", range=15, anis=c(90, 0.5))
set.seed(1)
sim$sim4 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

var.model <- vgm(psill=1, model="Exp", range=15, anis=c(135, 0.5))
set.seed(1)
sim$sim5 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

# promote to SP class object
gridded(sim) = ~x+y

new.names <- c('iso', 'aniso 0 deg', 'aniso 45 deg', 'aniso 90 deg', 'aniso 135 deg')
p1 <- spplot(sim, names.attr=new.names, col.regions=topo.colors(100), as.table=TRUE, main="Demonstration of Anisotropy")

Demonstration of Range ParameterDemonstration of Range Parameter

 
Demonstrate Range Parameter

var.model <- vgm(psill=1, model="Exp", range=1)
set.seed(1)
sim <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata = xy, nsim = 1)

var.model <- vgm(psill=1, model="Exp", range=5)
set.seed(1)
sim$sim2 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

var.model <- vgm(psill=1, model="Exp", range=15)
set.seed(1)
sim$sim3 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

var.model <- vgm(psill=1, model="Exp", range=30)
set.seed(1)
sim$sim4 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

# promote to SP class object
gridded(sim) = ~x+y

new.names <- c('range = 1', 'range = 5', 'range = 10', 'range = 30')
p2 <- spplot(sim, names.attr=new.names, col.regions=topo.colors(100), as.table=TRUE, main="Demonstration of Range Parameter")

Demonstration of Sill ParameterDemonstration of Sill Parameter

 
Demonstrate Sill Parameter

var.model <- vgm(psill=0.5, model="Exp", range=15)
set.seed(1)
sim <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata = xy, nsim = 1)


var.model <- vgm(psill=1, model="Exp", range=15)
set.seed(1)
sim$sim2 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

var.model <- vgm(psill=2, model="Exp", range=15)
set.seed(1)
sim$sim3 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

var.model <- vgm(psill=4, model="Exp", range=15)
set.seed(1)
sim$sim4 <- predict(gstat(formula=z~1, locations= ~x+y, dummy=TRUE, beta=0, model=var.model, nmax=20), newdata=xy, nsim=1)$sim1

# promote to SP class object
gridded(sim) = ~x+y

new.names <- c('sill = 0.5', 'sill = 1', 'sill = 2', 'sill = 4')
p3 <- spplot(sim, names.attr=new.names, col.regions=topo.colors(100), as.table=TRUE, main="Demonstration of Sill Parameter")

Comparison of PSA Results: Pipette vs. Laser Granulometer

 
Soil texture data was collected via pipette and laser granulometer, each horizon from three pedons. This example illustrates a simple approach to comparing the two methods with both standard XY-style scatter plot and on a soil textural triangle. This example uses code in the plotrix package for R, but you could also use this python approach.

The data referenced in these examples are attached at the bottom of this page. The code boxes below represent what a user would type into the R console. Lines prefixed with a '#' are interpreted by R as a comment, and thus ignored. Further visualization examples, using a larger dataset, can be accessed by clicking on the link at the bottom of this page. The goals of this example are:

  • import data into R
  • plot data
  • perform a simple linear regression
  • plot sand, silt, clay data on a textural triangle

Example commands can be directly pasted into the R console, or typed by hand. I would recommend copyinf a single line of example code at a time into the R console, then press the ENTER key. In this way the results of each command will be visible. Remember that the str() function will give you information about an object. Note that in order to load the sample data, you will need to set your working directory in R to the same folder in which you downloaded the sample data. For example: if you downloaded the sample data to your Desktop, you would set your working directory with:

  • on a mac: setwd('~/Desktop')
  • on windows: setwd('C:\path_to_your_desktop') where 'path_to_your_desktop' is the path to the desktop folder

Optionally, you can use the file.choose() command to bring up a standard file selection box. The result of this function can then be pasted into the read.table('....') function, replacing the '...' with the data returned from file.choose().

Sample Plot: Pipette vs. GranulometerSample Plot: Pipette vs. Granulometer

Sample Plot: Soil Textural Triangle: pipette values are in red, granulometer values are in blue.Sample Plot: Soil Textural Triangle: pipette values are in red, granulometer values are in blue. Line segments connect corresponding observations.

 
Load Required Packages and Input Data

# the package 'plotrix' can be installed with:
# install.packages('plotrix', repos='http://cran.r-project.org', dependencies=TRUE)
# note 1: you can accomplish this through the R-GUI in windows / mac os
# note 2: on UNIX-like systems you will need to start R as superuser to install packages
# load required package
require(plotrix)

#read in text data: note that they are TAB-DELIMITED: <b>sep='\t'</b>
p <- read.table('psa-pipette.txt', header=T, sep="\t")

#note that the granulometer data is whitesdpace delimeted: i.e. no 'sep=' argument
g <- read.table('gran-psa.txt', header=T)

 
Initial Comparison of Clay Values See Figure 1

#do some initial comparisons:
plot(p$clay ~ g$clay)

#re-plot with custom settings:
# annotated axis, 0-100% range, plot symbols scaled by 0.8
plot(p$clay ~ g$clay, ylab="Pct. Clay: Pipette Method", xlab="Pct. Clay: Granulometer Method", main="Simple Plot", xlim=c(0,100), ylim=c(0,100), cex=0.8)

#add silt fraction to the SAME plot:
points(p$silt ~ g$silt,  cex=0.8, pch=2)
# add sand fraction to the SAME plot:
points(p$sand ~ g$sand,  cex=0.8, pch=3)

#add a legend:
legend(x=2.7, y=94, legend=c('Clay','Silt','Sand'), pch=1:3, cex=0.8)

#simple linear modeling: add trend lines
abline(lm(p$clay ~ g$clay), col="gray")
abline(lm(p$silt ~ g$silt), col="gray")
abline(lm(p$sand ~ g$sand), col="gray")

 
Simple Linear Model

#create a formular object
f <- p$clay ~ g$clay
#create a model object
m <- lm( f )
#return the details on the new model:
summary( m )

#the following is the output:

Call:
lm(formula = p$clay ~ g$clay)

Residuals:
     Min       1Q   Median       3Q      Max
-13.6120  -6.1412   0.4438   4.8047  19.3997

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    3.761      6.477   0.581   0.5707  
g$clay         3.052      1.093   2.793   0.0144 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.664 on 14 degrees of freedom
Multiple R-Squared: 0.3577,     Adjusted R-squared: 0.3119
F-statistic: 7.798 on 1 and 14 DF,  p-value: 0.01439

 
Sample soil texture data plotted on the texture triangle See Figure 2

#subset sand, silt, clay information for texture triangle plot:
p.ssc <- data.frame(sand=p$sand,silt=p$silt,clay=p$clay)
g.ssc <- data.frame(sand=g$sand,silt=g$silt,clay=g$clay)

#plot a texture triangle from the pipette data
p.tri <- soil.texture(p.ssc,show.lines=T, show.names=T, tick.labels=seq(10,90,10), col.symbol='red', pch=16, cex=0.8, main="Soil Texture Triangle")
#add points from the granulometer data
g.tri  <- triax.points(g.ssc, col.symbol='blue', pch=16, cex=0.8)

#plot segments connecting (also see 'arrows' function)
segments(p.tri$x, p.tri$y, g.tri$x, g.tri$y, col='black', lty=1)

Extended Visualization Ideas and an Expanded Dataset

 
See attached file 'pipette_vs_granulometer.csv_.txt' at the bottom of this page.

 
Load Required Packages and Input Data

#the package 'plotrix' can be installed with:
# install.packages('plotrix', repos='http://cran.r-project.org', dependencies=TRUE)
# note 1: you can accomplish this through the R-GUI in windows / mac os
# note 2: on UNIX-like systems you will need to start R as superuser to
# install packages
#load required package
require(plotrix)

#read in text data: this is a CSV file: sep=","
x <- read.table('pipette_vs_granulometer.csv_.txt', header=T, sep=",")

#subset the data to include only sand, silt, clay columns
p <- data.frame(sand=x$p_sand, silt=x$p_silt, clay=x$p_clay)
g <- data.frame(sand=x$g_sand, silt=x$g_silt, clay=x$g_clay)

Sample 2D Plot: comparison between pipette and laser granulometer sand, silt, and clay valuesSample 2D Plot: comparison between pipette and laser granulometer sand, silt, and clay values

 
Simple 2D plot of corelation between pipette and granulometer: for sand, silt, clay

#plot with custom settings:
# annotated axis, 0-100% range, plot symbols scaled by 0.8
plot(p$clay ~ g$clay, ylab="Pct. Clay: Pipette Method", xlab="Pct. Clay: Granulometer Method", main="Simple Plot", xlim=c(0,100), ylim=c(0,100), cex=0.6, pch=16, col=1)

#add a 1:1 corrospondance line:
abline(0,1, col="gray", lty=2)

#add silt fraction to the SAME plot:
points(p$silt ~ g$silt,  cex=0.6, pch=16, col=2)
# add sand fraction to the SAME plot:
points(p$sand ~ g$sand,  cex=0.6, pch=16, col=3)

#add a legend:
legend(x=2.7, y=94, legend=c('Clay','Silt','Sand'), col=1:3, pch=16, cex=0.6)

#add locally smoothing estimator lines: lowess(x,y)
lines(lowess(g$clay, p$clay), col=1, lwd=2)
lines(lowess(g$silt, p$silt), col=2, lwd=2)
lines(lowess(g$sand, p$sand), col=3, lwd=2)

Soil Texture Triangle 2: visualization of differences between the two methodsSoil Texture Triangle 2: visualization of differences between the two methods

 
Compare the two datasets on the textural triangle

#plot soil textures on triangle
p.tri <- soil.texture(p,show.lines=T, show.names=T, col.symbol='red', pch=16, cex=0.7, main="Soil
Texture Triangle"
)
g.tri <- triax.points(g, col.symbol='blue', pch=16, cex=0.7)

#create dataframe of segment midpoints
mid <- data.frame(x=(p.tri$x + g.tri$x) / 2, y=(p.tri$y + g.tri$y) / 2 )

#plot a lowess function along the midpoints, ordered by x-coordinate
#what does this really mean?
lines( lowess(mid[order(mid$x), ]) , lwd=2, lty=2, col='green')

# plot overall shift: average p.x,p.y ---> avg g.x,g.y
arrows(mean(p.tri$x), mean(p.tri$y), mean(g.tri$x), mean(g.tri$y), len=.1, lwd=2)

#add a legend:
legend(.68,.79, legend=c("pipette","granulometer"), pch=16, cex=0.7, col=c('red','blue'))

# optionally: add clutter
#plot segments connecting related measurements
segments(p.tri$x, p.tri$y, g.tri$x, g.tri$y, col='gray', lty=1)

GRASS GIS: raster, vector, and imagery analysis

GRASS Example Image: Flow Accumulation

Hydrologic Modeling
GRASS Example Image: Surface Curvatures
Surface Curvatures
GRASS Example Image: 3D Visualization of Data
3D Visualization

A geographic information system (GIS) can be useful tool for describing, modeling, and interpreting data with a spatial reference. Unlike commerical GIS software GRASS is free, and its source code is available for review, modification, or updating. Soil scientists interested in using digital soil survey information, or 3D visualization of soils data may find that GRASS will fit their needs. In addition, GRASS contains hundreds of raster-based operations, such as DEM creation, hydrologic modeling, and solar radiation modeling.

If you have a Macintosh, installation of GRASS is quite simple. Please see Lorenzo Moretti's great website with instructions and an installer: here.

If you have a UNIX machine it is possible to install a pre-compiled binary version of GRASS. However, not every source has an updated version. For the latest updates, I would suggest compiling GRASS from source code.



Articles:
QGIS GRASS Recipes
GRASS6 in a nutshell by Markus Neteler
 
More:
Links from OSS for Soil Scientists Class

Compiling from Source Code (notes)

grass_logo
The GRASS Logo

 
The official instructions from the GRASS Wiki can be found here. My notes on compiling (specific to my requirements) are listed below. Note that you will need to adjust the flags in the configure steps according to your system.

 
Special Note when compiling GDAL with the Lizzard Tech DSDK:

  • be sure to use the full path to the DSDK, i.e. don't use shell expansion shortcuts (thanks Hobu)
  • A simple modification to the DSDK source must be performed for a successful build (see notes...)

 

  1. compile the source (see notes below)
  2. on debian this isn't too hard, as most of the dependancies can be met with an 'aptitude install xxx-dev'

  3. get a sample dataset either from the GRASS website, or I would be happy to provide you with some stuff from yolo county (which can of course be obtained for free, from http://gis.ca.gov)
  4. the GRASS Book is a helpfull reference, but it is based on the older branch of GRASS, version 5.3|5.4 ... and thus the vector model is a little out of date. The GRASS mailing list also a great place to get help.

Here are some of my notes on installing it from source:

  1. Install some required packages: these are the names for Debian-based systems
    • compiler and build tools: gcc g77 g++ flex bison m4 make libcfitsio-dev
    • database includes: libsqlite3-dev libmysqlclient14-dev libwrap0-dev (postgresql-includes)
    • GDAL-specific:libhdf4g-dev python-numeric
    • GRASS-specific: libncurses5-dev zlib1g-dev libreadline5-dev libfreetype6-dev libtiff4-dev libpng3-dev tcl8.3-dev tk8.3-dev (glutg3-dev libgle3-dev libgl1-mesa-dev) or (nvidia-glx-dev) fftw-dev
    • X.org requires: libxmu-headers libxmu-dev libxxf86vm-dev libxmuu-dev libxmu-dev
  2. Get and install PROJ4 from source (simple compile)
    • Don't forget the proj-datumgrid-1.4.zip file: unzip in the 'nad' folder before the ./configure step.
  3. Get and install the latest GEOS from source (simple compile)
  4. Get and install the latest GDAL and GDAL-GRASS plugin (relatively simple compile)
    ## 32-bit mode i7:
    export CHOST="i686-pc-linux-gnu"
    export CFLAGS="-O2 -pipe -march=native -msse4 -fomit-frame-pointer"
    export CXXFLAGS="${CFLAGS}" 
    
    ## 64-bit mode PowerPC
    #optimization from powerpc64:
    export CFLAGS="-mpowerpc64 -O2" 
    export CXXFLAGS="-mpowerpc64 -O2"
    export FCFLAGS="-O2 -mpowerpc64"
    export OBJCFLAGS="-O2 -mpowerpc64"
    export FFLAGS="-O2 -mpowerpc64"
    

     
    Special Note when compiling GDAL with NetCDF Support:

     
    I usually use the following HDF4 configure script:

    ./configure --disable-netcdf --disable-fortran --prefix=/usr/local

     
    I usually use the following GDAL configure script:

    ./configure --with-png=internal --with-jpeg=internal \
    --with-gif=internal --with-libtiff=internal --with-geotiff=internal \
    --with-netcdf --with-hdf4 \
    --without-ogdi --with-python --with-sqlite3 --with-spatialite \
    --with-threads=yes --without-grass \
    --with-mrsid=/home/dylan/src/Geo_DSDK-6.0.7.1407/ --with-opencl --with-curl

     
    Note that you will need to leave the Lizard Tech DSDK folder in the exact same location for subsequent compilation of GRASS.

  5. Checkout the latest development version of GRASS 6 via Subversion:
  6. # check out the software:
    svn co http://svn.osgeo.org/grass/grass/branches/develbranch_6 grass6_dev/
    
    # or update a local copy:
    svn up
    

     
    I usually use the following configure script:

    ./configure --with-tcltk-includes=/usr/include/tcl8.4 \
    --with-postgres --without-odbc --with-mysql \
    --with-mysql-includes=/usr/include/mysql/ \
    --with-freetype --with-freetype-includes=/usr/include/freetype2 \
    --with-readline --with-cxx --enable-largefile \
    --with-postgres-includes=/usr/local/pgsql/include/ \
    --with-postgres-libs=/usr/local/pgsql/lib/ --with-sqlite --with-python \
    --with-proj-share=/usr/local/share/proj/ \
    --with-cairo --with-pthread --with-wxwidgets
  7. Get and install the GDAL-GRASS plugin: GRASS Wiki Page on this topic
  8. Here are the steps that I use:

    #link GRASS libs to the /usr/local/lib/ folder
    cd /usr/local/lib/ ; sudo ln -s /usr/local/grass-6.5.svn/lib/*.so .
    #return to the gdal-grass plugin source directory:
    ./configure --with-grass=/usr/local/grass-6.5.svn/
    #make
    make clean && make
    #create a gdal-plug-ins folder
    sudo mkdir /usr/local/lib/gdalplugins
    #copy the compiled plugins to the gdalplugins folder
    sudo cp *.so /usr/local/lib/gdalplugins/
    #done
  9. Finally, you might need to update your /etc/ld.so.conf to include GRASS libs
  10. Here is my /etc/ld.so.conf Don't forget to run ldconfig after changing this file!

    ---------/etc/ld.so.conf---------------
    ...
    /usr/local/lib
    /usr/local/grass-6.5.svn/lib
    ...
    -----------------------------------------

GRASS and POVRAY

Adding GRASS vectors to POVRAY Scenes: v.out.pov
GRASS Vectors
povray clipped ssurgo data
SSURGO cutaway
povray slope class cutaway image
Slope class cutaway
PINN Topo-map with POVRAY
Topomap
McCabe Canyon and Temblor Formation
Panchromatic
View of Mt. Defiance from Bear Valley
Near infrared
Over PINN: Geomorphic Features
Geomorphic features.
View of Mt. Defiance from Bear Valley. 2 meter true color
View of Mt. Defiance from Bear Valley, Pinnacles National Monument. Pan Chromatic 2m res.

Thanks to Markus Neteler for initial insight on how to build an appropriate POVRAY script file.

 
GRASS pre-processing

 # C/O  Markus Neteler
# modified by Dylan Beaudette for UTM projections
# 8/22/2005

#GRASS procedures:

# Note that elevation and overlay must be the same size (dx, dy)

# set the region:
g.region res=4

# dem export cubic-resample from 10m to 4m resolution:
r.resamp.interp method=cubic in=elev_meters out=elev.4m

# reset resolution to match resampled DEM
# this might be required if your DEM does not extend beyond the extent of the resampling step
# r.resamp.interp will automatically extend the region when performing resampling, if the raster is large enough
# g.region rast=elev.4m
r.out.pov map=elev.4m tga=elev.4m.tga

# overlay export
r.out.png in=ikonos.rgb out=ikonos.4m.png

# Render with POVRAY:
povray +Ipinn.pov +Opinn.png +D +P +W800 +H800 +A

 
pinn.pov

 # include "shapes.inc"
# include "colors.inc"
# include "textures.inc"
# include "skies.inc"

# declare Sun = color rgb <0.8, 0.8, 0.8>;
# declare Scale = 1;


light_source { < 667167, 70000, 4040564 > color White }

fog {
 fog_type 2
 fog_offset 50
 fog_alt 65
 color rgb <0.80, 0.88, 0.82>
 distance 80
 up y }

camera {
   //looking at Mt. Defiance from Bear Valley
   location <667101.25310945, 600, 4042753.78420398 > //easting, elev, northing
   look_at < 664250.03420398, 618, 4037343.47947761 > //easting, elev, northing
   
   //looking at McCabe Canyon
   //location < 667259.00497512, 2000, 4036823.48258706 > //easting, elev, northing
   //look_at < 665225.75870647, 300, 4041018.51368159> //easting, elev, northing
   
   angle  60 // vertical camera aperture
   
}


height_field  {
   tga "elev.4m.tga"
   smooth
   texture {
      pigment {
          image_map { // image is always projected from -z, with front facing +z, top to +Y                    
             png "ikonos.4m.png"        //our overlay image    
//           png "tci.4m.png"        
             once      
          }
          rotate x*90 // align map to height_field
      }
   }
   finish {
          ambient 0.2         // Very dark shadows
          diffuse 0.8         // Whiten the whites
          phong 0.2          // shiny    
          phong_size 100.0    // with tight highlights
          specular 0.5
          roughness 0.05
   }
   scale < 9270, 65535, 9395 > // dx,dz,dy of extent in meters
   translate  < 660575, 0, 4036730> // west_corner, 0, south_corner
}



sky_sphere{
    pigment {planar colour_map{[0,rgb <0.9,0.9,1>][1, rgb <0.1,0.2,1>]}}
    pigment{P_Cloud4}
    scale 2 //avoid repetion
    translate 300  //dx, dz, dy
}

Using QGIS as an alternate GUI for GRASS

QGIS and GRASS Data
QGIS directly accessing GRASS data
QGIS Attribute Table
Accessing attribute data in QGIS
QGIS GRASS Toolkit
The QGIS GRASS Toolkit

 
My Notes for an Install on Debian/Linux
Note that any old gdal-include files or libraries will cause problems in the make step.

 
Getting Started

  • install the latest GEOS for the latest build of QGIS.
  • install the xorg-dev and xserver-xorg-dev packages

 
Related Links:

 
My build method: (you might want a different branch)

# checkout source
svn co https://svn.osgeo.org/qgis/trunk/qgis qgis-unstable
cd qgis-unstable/

# configure to use GRASS and a custom install of PostgreSQL
# don't build bindings... something weird with debian/unstable
cmake -D GRASS_PREFIX=/usr/local/grass-6.5.svn/ -D WITH_BINDINGS=OFF -D POSTGRES_CONFIG=/usr/local/pgsql/bin/pg_config .

# if you have a multiprocessor machine, use a parallel build
make -j4
sudo make install

 
Some troubleshooting tips:

  • Run ldd /usr/local/bin/qgis | grep "not found" to test for missing libs.. if there are any missing libs, try running sudo ldconfig to fix things.
  • To add a custom SRID to your QGIS install, add the folowing to the srs Sqlite db:
    INSERT INTO "tbl_srs" VALUES(9001,'CaSoilResource','aea','GRS80','+proj=aea +x_0=0.0 +y_0=0 +lon_0=-96 +lat_0=40.0 +lat_1=20 +lat_2=60.0 +datum=NAD83 +ellps=GRS80 +units=m +no_defs',100000,9001,0);
  • do NOT compile the source code in the root working copy directory

Importing Various Types of Vector and Raster Data

Generic data import and export commands rely on the GDAL/OGR library to be installed correctly. If you have installed GRASS via a binary package, chances are that these commands will work fine. If you have compiled GRASS and GDAL from source, and have not re-compiled GDAL with GRASS support then raster export via r.out.gdal will not work. See this page for details.

The following examples assume that all of the input data has been projected to a consistant coordinate system, and that an appropriate GRASS location has been created. Command names are in boldface type, input files are red, and the resulting GRASS data are in green. Figures are provided for additional context, and are referenced by input/output file name in the example code.

Vector Import

  • Import a SSURGO survey boundary polygon shape file from the current directory:
  • v.in.ogr dsn=. layer=soilsa_a_xxx output=boundary


Raster Import

  • Import a DOQQ file from the current directory:
  • r.in.gdal input=doqq.tif output=doqq


Text File Import

  • Import an ascii file of x,y,z records from a GPS:
  • #input_file: meters_x | meters_y | meters_z
    665015|4036139|100
    659998|4039460|123
    659651|4039686|244
    659466|4039621|400
    #-----
    v.in.ascii -z in=input_file output=gps_points
  • Import an ascii file with variable delimiters
  • # reformat input text file to single-space delimeted
    awk '{print $1, $2, $3}' classes/summer_2005/SSC105/maps/pit_locations.xy | v.in.ascii fs=" " x=1 y=2 out=ssc105s
    # export to ESRI shapefile with projection stored in .prj file
    v.out.ogr -e --v type=point in=ssc105s dsn=ssc105s.shp


GPS Import

  • Import track data from a Garmin Serial GPS. Note that gpsbabel will download points in lon/lat WGS84 format. (gpsbabel must be installed). More ideas here...

    v.in.gpsbabel -t in=/dev/ttyS0 out=gps_tracks format=garmin

Vector Operations

A nice introduction to the GRASS vector model can be found here.

Importing and Exporting from/to a Garmin GPS

Importing and exporting GRASS vector data from/to GPS unit can be done in many ways. The commands v.in.garmin and v.in.gpsbabel work well for importing track and waypoint data from a GPS. Note however that v.in.garmin requires points and track data to be stored in decimal degree format. A simpler approach for dealing with both import and export of vector data from GRASS cab be done with a little bit of awk and the program gpstrans. A few common examples are listed below. Command names are in boldface type, input files are red, and the resulting GRASS data are in green.

 

Gpsbabel

 
From GRASS

# garmin-specific notes here
http://www.gpsbabel.org/htmldoc-1.3.2/fmt_garmin.html

1. dowload data vis gpsbabel to GPX:
 - import with v.in.gpsbabel
 - convert directly to shp with gps2shp:
   http://gpx2shp.sourceforge.jp/  (source)

 
Samples from the command line.

# download waypoints from garmin to GPX file
gpsbabel -i garmin -f /dev/ttyS0 -o gpx -F garmin_points.gpx

# download track from garmin to GPX file
gpsbabel -t -i garmin -f /dev/ttyS0 -o gpx -F garmin_track.gpx

# upload to garmin GPS
gpsbabel -i gpx -f garmin_points.gpx -o garmin -F /dev/ttyUSB0

Gpstrans

 
Import all way points with labels starting with "0":

#download all waypoints from gps
gpstrans -dw > all_wpts_5-3.ascii
#output from gpstrans looks like:

Format: UTM  UTC Offset:  -7.00 hrs  Datum[066]: NAD83
W       001     01-MAY-06 22:58                                 12/31/1989 -7:00:00     10      S       0662251 4039089
W       002     01-MAY-06 23:20                                 12/31/1989 -7:00:00     10      S       0661915 4039282
W       003     01-MAY-06 23:36                                 12/31/1989 -7:00:00     10      S       0662224 4039287
W       004     02-MAY-06 00:06                                 12/31/1989 -7:00:00     10      S       0662002 4039768

 
#import only those which start with "0" into GRASS
awk '{OFS="|"} $2 ~ /^0.?/{print $9,$10,$2}' all_wpts_5-3.ascii | v.in.ascii out=new_vector columns="x int, y int, id varchar(3)"

 
Export from points to Garmin GPS:

#export with a "P" prefix to point names
v.out.ascii spring3 dp=0 | \
awk -F"|" '
BEGIN{
OFS="\t"; print "Format: UTM  UTC Offset:  -7.00 hrs  Datum[066]: NAD83"
}
{
print "W\tP"$3"     \t28-MAR-07 20:35                         \t12/31/1989 -7:00:00\t10\tS\t0"$1"\t"$2
}
' > garmin_points.waypts
#upload new waypoints to gps
gpstrans -uw garmin_points.waypts

 
Export new route to Garmin GPS:

##now make a route file, using points P1 --> P6
v.out.ascii pygmy_route_points \
| awk -F"|" 'BEGIN{OFS="\t"; print "Format: UTM UTC Offset: -7.00 hrs Datum[066]: NAD83\nR\t0\t________________"}
{split($1,easting,"."); split($2,northing,".");
print "W","P"$3,"02-MAY-06 17:19","12/31/1989 -7:00:00","10","S","0"easting[1],northing[1]}' > pygmy_route_points.route
 
#upload new waypoints to gps
gpstrans -ur pygmy_route_points.route

Raster profile along arbitrary line segments

 
Overview
A simple experiment illustrating the calculation of elevation profiles along arbitrary line segments using a combination of GRASS and R. Fractal dimension was calculated for each profile by spectral decomposition (FFT), as suggested in (Davis, 2002). Note that the elevation profiles are calculated in a rather non-standard way, as elevation is plotted against 3D distance along the original line segment. This means that resulting transect length calculations are sensitive to the resolution of the raster (or distance between sample points), across a rough surface: i.e. smaller grid spacing will result in a longer effective (3D) transect length.

 
References

  1. Davis, J.C. Statistics and Data Analysis in Geology John Wiley and Sons, 2002

Sample profile creation I: Vector representation of 2 transects performed by an adventurous soil scientist near McCabe Canyon, Pinnacles National Monument.Sample profile creation I: Vector representation of 2 transects performed by an adventurous soil scientist near McCabe Canyon, Pinnacles National Monument.

 
Create 2 profiles in GRASS

 # digitize some new lines in an area of interest
v.digit -n map=valley bgcmd="d.rast elev_meters"
v.digit -n map=mtns bgcmd="d.rast elev_meters"

# convert to points
v.to.points -i -v -t in=valley out=valley_pts type=line dmax=50
v.to.points -i -v -t in=mtns out=mtns_pts type=line dmax=50

# set resolution to match DEM
g.region -pa res=4

# extract elevation at each point
v.drape in=valley_pts out=valley_pts_3d type=point rast=elev_4m method=cubic
v.drape in=mtns_pts out=mtns_pts_3d type=point rast=elev_4m method=cubic

# export to text files
v.out.ascii valley_pts_3d > valley_pts_3d.txt
v.out.ascii mtns_pts_3d > mtns_pts_3d.txt

# start R
R


Sample profile creation II: Elevation profiles for the paths taken by the soil scientist here.Sample profile creation II: Elevation profiles for the paths taken by the soil scientist here.

 
Process profile data, and plot in R

# read in the data, note that there is no header, and the columns are pipe delimited
valley <- read.table('valley_pts_3d.txt', header=F, sep="|", col.names=c('easting','northing','elev','f'))
mtns <- read.table('mtns_pts_3d.txt', header=F, sep="|", col.names=c('easting','northing','elev','f'))


# compute the length minus 1 ; i.e. the number of points - 1
# use for the lagged distance calculation
s.valley <- length(valley$easting) - 1
s.mtns <- length(mtns$easting) - 1

# calculate the lagged distance along each component of the 3D vector
dx.valley <- valley$easting[2:(s.valley+1)] - valley$easting[1:s.valley]
dy.valley <- valley$northing[2:(s.valley+1)] - valley$northing[1:s.valley]
dz.valley <- valley$elev[2:(s.valley+1)] - valley$elev[1:s.valley]

dx.mtns <- mtns$easting[2:(s.mtns+1)] - mtns$easting[1:s.mtns]
dy.mtns <- mtns$northing[2:(s.mtns+1)] - mtns$northing[1:s.mtns]
dz.mtns <- mtns$elev[2:(s.mtns+1)] - mtns$elev[1:s.mtns]

# combine the vector components
# 3D distance formula
c.valley <- sqrt( dx.valley^2 + dy.valley^2 + dz.valley^2)
c.mtns <- sqrt( dx.mtns^2 + dy.mtns^2 + dz.mtns^2)

# create the cumulative distance along line:
dist.valley <- cumsum(c.valley[1:s.valley])
dist.mtns <- cumsum(c.mtns[1:s.mtns])

# plot the results
par(mfrow=c(2,1))
plot(dist.valley, valley$elev[1:s.valley], type='l', xlab='Distance from start (m)', ylab='Elevation (m)', main='Profile')
plot(dist.mtns, mtns$elev[1:s.mtns], type='l', xlab='Distance from start (m)', ylab='Elevation (m)', main='Profile')



Sample profile creation III: Spectral decomposition (by FFT) for each profile.Sample profile creation III: Spectral decomposition (by FFT) for each profile.

 
Compute fractal dimension of each transect by FFT

 #compute the fractal dimension via [1]
s.valley <- spectrum(valley$elev[1:s.valley])
s.mtns <- spectrum(mtns$elev[1:s.mtns])

# compute linear model of log(power) ~ log(1/frequency)
# use the slope estimate in [2]

# first define the model formuals: for plotting and for lm()
fm.valley <- log(s.valley$spec) ~ log(1/s.valley$freq)
fm.mtns <- log(s.mtns$spec) ~ log(1/s.mtns$freq)

# next compute the least-squares regression lines via lm()
lm.valley <- lm(fm.valley)
lm.mtns <- lm(fm.mtns)

# extract the slope of each regression line:
slope.valley <- coef(lm.valley)[2]
slope.mtns <- coef(lm.mtns)[2]

# compute Df - the fractal dimension from [3]
Df.valley <- (5 - slope.valley) / 2
Df.mtns <- (5 - slope.mtns) / 2

# plot the figures
plot(fm.valley, main=paste("Valley Profile\nFractal Dimension: ", round(Df.valley,3), sep=''), xlab='Log(1/frequency)', ylab='Log(Power)' )
abline(lm.valley)

plot(fm.mtns, main=paste("Mountain Profile\nFractal Dimension: ", round(Df.mtns,3), sep=''