General Purpose Programming with Scripting Languages
Some initial hints of use of various scripting languages.
Python
Ruby
Python Image Module Example: How much ink on that page?
Premise:
Thought it would be fun to compute how much ink a given poster requires, per unit area of paper, when sending to the department large-format printer. The Python Imaging Library provides several modules suitable for low-level operation on image data. A simple (and probably very inefficient) script was developed to compute the white/black percentage of an image. A script like this could be used to adjust a per-poster "ink cost", which would hopefully prevent people from wasting ink. Obviously, this computation is scale-dependent, so standardized rasterization parameters would have to be set in order for the "ink cost" calculation to be fair. More generalized or efficient approaches are always welcomed.
Implementation: (when copying/pasting note whitespace in blocks)
#!/usr/bin/env python
# invoke like this: ./white.py image.png
# idea from http://www.halfbakery.com/user/lurch
import sys
import Image
# cheap and non-robust way to access first argument
file = sys.argv[1]
# open the image, load into memory
pic = Image.open( file )
imgdata = pic.load()
# extract the dimensions of the image
xsize, ysize = pic.size
# init some counters
black = 0
white = 0
# define some colors (R,G,B)
b_color = (0,0,0)
w_color = (255,255,255)
# loop over columns
for x in xrange(xsize):
# loop over rows
for y in xrange(ysize):
if imgdata[x,y] == w_color:
white+=1
if imgdata[x,y] == b_color:
black+=1
# compute percentages
total_pixels = xsize * ysize
pct_white = (float(white) / float(total_pixels)) * 100
pct_black = (float(black) / float(total_pixels)) * 100
# formatted printing
print "%.3f%% of image is white!" % (pct_white)
print "%.3f%% of image is black!" % (pct_black)
Simple Python Interface to USGS TNM Elevation Service
Premise
Wanted a simpler way to access the USGS seamless elevation look-up service. Python seemed like a logical start. Note that the response from the USGS webservice is not correctly identified as valid XML by the python XML-parser. Therefore there is a small amount of scrubbing used to coerce the response into valid XML. Comments on why this is, or is not, a good idea are welcome. Here is a link to an interactive version.
Update It looks like the USGS service does not accept POST-style requests. I have made some small changes to the script below.
Example Listing
# nice CLI interface
import sys
from optparse import OptionParser
# url access
import urllib, urllib2
# simple XML parsing
from xml.dom import minidom
# CSV file parsing
import csv
# define service URL
base_url = 'http://gisdata.usgs.gov/XMLWebServices/TNM_Elevation_Service.asmx/getElevation?X_Value=%f&Y_Value=%f&Elevation_Units=meters&Source_Layer=-1&Elevation_Only=1'
#process command line (optparse)
parser = OptionParser()
parser.add_option("-f", "--file", dest="infile", help="input csv file containing WGS84 (lon,lat,site_id) record", metavar="FILE")
# process args
(options, args) = parser.parse_args()
#require an input file
if not options.infile:
print "ERROR: must supply an input file!"
sys.exit(1)
#open input file
try:
infile = options.infile
reader = csv.reader(open(infile, "rb"))
except:
print "ERROR: Cannot open: " + infile
sys.exit(1)
# read the csv file
for row in reader:
lon = float(row[0])
lat = float(row[1])
site_id = row[2]
# the base URL of the web service
url = 'http://gisdata.usgs.gov/XMLWebServices/TNM_Elevation_Service.asmx/getElevation'
# url GET args
values = {'X_Value' : lon,
'Y_Value' : lat,
'Elevation_Units' : 'meters',
'Source_Layer' : '-1',
'Elevation_Only' : '1', }
# make some fake headers, with a user-agent that will
# not be rejected by bone-headed servers
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent' : user_agent}
# encode the GET arguments
data = urllib.urlencode(values)
# make the URL into a qualified GET statement:
get_url = url + '?' + data
# make the request: note that by ommitting the url arguments
# we force a GET request, instead of a POST
req = urllib2.Request(url=get_url, headers=headers)
response = urllib2.urlopen(req)
the_page = response.read()
# convert the HTML back into plain XML
for entity, char in (('lt', '<'), ('gt', '>'), ('amp', '&')):
the_page = the_page.replace('&%s;' % entity, char)
# clean some cruft... XML won't parse with this stuff in there...
the_page = the_page.replace('<string xmlns="http://gisdata.usgs.gov/XMLWebServices/">', '')
the_page = the_page.replace('<?xml version="1.0" encoding="utf-8"?>\r\n', '')
the_page = the_page.replace('</string>', '')
the_page = the_page.replace('<!-- Elevation Values of -1.79769313486231E+308 (Negative Exponential Value) may mean the data source does not have values at that point. --> <USGS_Elevation_Web_Service_Query>', '')
# parse the cleaned XML
dom = minidom.parseString(the_page)
children = dom.getElementsByTagName('Elevation_Query')[0]
# extract the interesting parts
elev = float(children.getElementsByTagName('Elevation')[0].firstChild.data)
data_source = children.getElementsByTagName('Data_Source')[0].firstChild.data
# print to stdout
print "%f,%f,%s,%f,%s" % (lon, lat, site_id, elev, data_source)
Incantation
python elev_lookup.py -f ca_cities.txt
Input File CA cities in WGS84 (lon,lat,id) records
-122.32,37.78,Alameda NAS
-120.53,41.48,Alturas
-124.1,40.98,Arcata
-119.05,35.43,Bakersfield
-121.45,39.13,Beale AFB
-116.95,33.93,Beaumont
-116.62,35.28,Bicycle Lk
-116.68,34.27,Big Bear Apt
-118.6,37.60,Bishop
-120.7,39.28,Blue Canyon
-114.72,33.62,Blythe
-118.37,34.20,Burbank
Output
-122.320000,37.780000,Alameda NAS,2.123548,NED Contiguous U. S. 1/3W arc second elevation data
-120.530000,41.480000,Alturas,1331.056396,NED Contiguous U. S. 1/3W arc second elevation data
-124.100000,40.980000,Arcata,63.749836,NED Contiguous U. S. 1/3W arc second elevation data
-119.050000,35.430000,Bakersfield,148.473618,NED Contiguous U. S. 1/3W arc second elevation data
-121.450000,39.130000,Beale AFB,29.950783,NED Contiguous U. S. 1/3W arc second elevation data
-116.950000,33.930000,Beaumont,792.840881,NED Contiguous U. S. 1/3W arc second elevation data
-116.620000,35.280000,Bicycle Lk,711.765869,NED Contiguous U. S. 1/3W arc second elevation data
-116.680000,34.270000,Big Bear Apt,1697.037720,NED Contiguous U. S. 1/3W arc second elevation data
-118.600000,37.600000,Bishop,2209.536865,NED Contiguous U. S. 1/3W arc second elevation data
-120.700000,39.280000,Blue Canyon,1603.887573,NED Contiguous U. S. 1/3W arc second elevation data
-114.720000,33.620000,Blythe,120.626007,NED Contiguous U. S. 1/3W arc second elevation data
-118.370000,34.200000,Burbank,224.287033,NED Contiguous U. S. 1/3W arc second elevation data
-117.350000,33.300000,Camp Pendlet,21.895588,NED Contiguous U. S. 1/3W arc second elevation data
Text processing with the UNIX toolbox
Working with DBF files
- Extract the text contents of a DBF file with dbfxtrct, select out the 3rd and 2nd columns with awk, keep only unique rows with sort and uniq, and finally saving to a new file called 'ca_fips-name_list.txt' with shell redirection.
# dbfxtrct utility
dbfxtrct -ica_counties_aea83.dbf | awk -F"," '{print $3","$2}' | sort -g | uniq > ca_fips-name_list.txt
-
Other DBF tools include:
- dbf2mysql
- dbf2pg
- dbfadd
- dbfcreate
- dbfdump
- dbfutil1
- dbfxtrct
LaTeX Tips and Tricks
A collection of links to documents which I have found helpful in working with the LaTeX document preparation system. An internal link to all documents related to LaTeX can be accessed here.
Getting Started
CV Creation
Journal Submission
Collaborative Writing Ideas
Customizing BibTeX
Premise:
See attached files for a bibliography style (.bst) file that is compatible with the SSSAJ (Soil Science Society of America Journal) style guidelines.
General Notes:
- when citing named chapters in a book, use the incollection entry type
- when using makebst get the style as close as you can to the target, then tweak the resulting .bst file
Documentation:
http://www.andy-roberts.net/misc/latex/latextutorial3.html
A Custom Bib Style:
latex makebst
Latex to ODT conversion tips
Conversion of a .tex document to open document format (RTF)
latex2rtf file.tex
Conversion of a .tex document to open document format (ODT)
mk4ht oolatex file.tex
Some notes:
- disable hyperref package to prevent odd bug in citation listings
- input document should be latex-safe, i.e. eps figures
- need a working java installation
- commands like \onehalfspacing or begin{singlespace}...end{singlespace} may not work (setspace package)
Conversion of PDF, EPS, or Postscript figures to EMF (enhanced metafile format) for windows apps.
Note that these instructions are for unix-like systems. EMF support can be added to pstoedit via libEMF. The libEMF library contains several windows-specific coding practices, along with some sloppy use of include files. Instructions for compiling with a modern version of GCC (4.x) can be found here.
- Some related pages
- http://www.physik.tu-cottbus.de/~george/unix/emf4unix.html
- http://boshoff.za.net/linux/archives/2003_08.html
- download, hack, and compile the libEMF code (details)
- download, configure, and install pstoedit, note that you may have to manually define where the libEMF header and lib files are:
./configure --with-libemf-include=/usr/local/include/libEMF/ \
--with-libemf-lib=/usr/local/lib/
- run ldconfig as root to refresh the library cache
- [Finally] Convert a PDF file into an OpenOffice-compatible EMF file (well sort of...)
pstoedit -pta -f emf:"-OO -p -drawbb" file.pdf file.emf
- It seems like figures are clipped to smaller-than-expected bounding boxes unless the special flag -drawbb is used. One side-effect of this however is a box drawn around your figure. Also, the -pta flag is needed to ensure quality text placement. Even though the resulting EMF files may look odd on a Linux machine, they appear to work as expected in Windows.
Making better looking tables with \ctable
Premise:
The standard Latex table environment can be difficult to extend, especially when one wants to use modern constructs such as table foot notes, etc. The ctable package is a convenient approach to solving this problem. A complete worked example of a table typset using the \ctable command is presented below. Note that \ctable is a command, and therefore does not allow blank newline characters. One way to maintain readability is to trick Latex by adding a comment character to all blank lines. Link to PDF manual for ctable.
Example of table produced with ctable
Code used to produce the example above. This example was produced with TexLive.
%
% start the table: note that we cannot have extra newline characters in the ctable defs
\ctable[
cap = {logistic regression parameters},
%
caption = {Logistic regression model (M3) parameters. Coefficients, standard error, z-values, and p-values are included for each term used in the model. A separate slope and intercept term was fit to each geologic class.},
%
label = {aspect_effect:table:glm_model_properties},
%
%
]{lccc}{
%
\tnote[$\ast$]{Marginal p-values are used to determine whether each term is significantly different than 0.}
%
}{ \FL
%
Model Term & Value & Std. Error & Marginal p-values\tmark[$\ast$] \ML
%
Intercept & & & \\ \cmidrule(r){1-1}
%
\;\; andesite & 1.146e+01 & 6.920e+00 & 0.09781 \\
\;\; clastic\_volcanic & 3.471e+00 & 2.234e+00 & 0.12024 \\
\;\; coarse\_sedimentary & 4.446e+00 & 2.390e+00 & 0.06285 \\
\;\; felsic\_intrusive & 5.384e+00 & 2.714e+00 & 0.04725 \\
\;\; fine\_sedimentary & 8.527e+00 & 4.013e+00 & 0.03361 \\
\;\; rhyolite & 1.075e+01 & 4.071e+00 & 0.00828 \\
\;\; tuff & 1.657e+01 & 9.116e+03 & 0.99855 \\
%
%
\\
Slope & & & \\ \cmidrule(r){1-1}
%
\;\; andesite & -1.356e-03 & 8.448e-04 & 0.10858 \\
\;\; clastic\_volcanic & -3.588e-04 & 2.851e-04 & 0.20825 \\
\;\; coarse\_sedimentary & -8.165e-04 & 3.336e-04 & 0.01438 \\
\;\; felsic\_intrusive & -6.352e-04 & 3.503e-04 & 0.06977 \\
\;\; fine\_sedimentary & -1.122e-03 & 5.139e-04 & 0.02897 \\
\;\; rhyolite & -1.361e-03 & 4.873e-04 & 0.00522 \\
\;\; tuff & 6.155e-14 & 1.200e+00 & 1.00000
%
%
\LL}
PostGIS: Spatially enabled Relational Database Sytem
PostGIS is a set of extensions to the relational database management system PostgreSQL, which provide access to spatial constructs, operators, and functions. In other words, PostGIS is a spatially-aware version of Postgresql. Paul Ramsey of Refractions Research has written up a nice summary of how were are using PostGIS.

Tuning Tips:
- Adjust kernel resources to allow larger memory allocation: (documentation)
- Increase the shared_buffers to a larger number (16000)
- See attached PDF at the bottom of this page for excellent ideas from Kevin Neufeld
PostGIS Syntax Examples:
- PostGIS Spatial SQL cheat-sheet
- Official Documentation
- PostGIS Wiki
PostGIS Presentations:
- Spatial Analysis with PostGIS presentation at PGCon2009 (Leo Hsu and Regina Obe)
- Tips for the PostGIS Power User (Kevin Neufeld, Refractions Research)
- PostGIS Spatial Database: Introduction and Case Studies (Paul Ramsey)
- Transitioning Low Earth Orbit Satellite Archive Data from Informix (Geodetic DataBlade) to PostgreSQL (PostGIS) (Churngwei Chu, NASA/SSAI)
- Intro to PostgreSQL/PostGIS and MapServer (Arnulf Christ)
Importing and Exporting
Tips
- Attribute data stored with a shapefile (the .dbf file) can only use column names shorter than 14 characters. If you try and export a postGIS table that has column names longer than 14 characters, they will be truncated when converted to a shapefile.
- If a column in postGIS is defined as numeric, ogr2ogr will not correctly interpret the values in this column: i.e. floating point values less than 1 will be truncated to 0. In order to avoid this problem always define your colums as double precision or float. For dynamically generated tables, cast affected columns to a defined numeric precision: select colum_a::numeric(7,3) where '7' refers to the total number of digits in the column, and '3' refers to the number of decimal places.
Geometry and attribute data to GIS file format
GDAL/OGR tools This approach allows simultaneous conversion of coordinate systems, but is less flexible with respect to generation of new tables in PostGIS.
- Import
ogr2ogr -f "PostgreSQL" PG:'dbname=ssurgo_combined user=xxxx password=xxxx host=postgis.server.edu' input_file.shp
- Export
ogr2ogr output_file.shp PG:'dbname=ssurgo_combined user=xxxx password=xxxx host=postgis.server.edu' tablename
Note that tables must be correctly 'registered' in the geometry_columns table for this to work:
INSERT INTO geometry_columns VALUES ('','public','tablename','wkb_geometry',2,SRID,'geomtype');
PostGIS Loader/Dumper This approach is the simplest, but does not allow on-the-fly conversion of coordinate systems.
- Import
shp2pgsql -s SRID -c -g wkb_geometry -I shapefile.shp schema.table | psql -U username -h host database
Note that SRID is the PostGIS 'spatial ref. sys. id' (see the spatial_ref_sys table). See the manual page for shp2pgsql for a complete list of arguments and their meanings.
- Export
pgsql2shp -f shapefile.shp -h host -u username -P password -k -g wkb_geometry database schema.table
See the manual page for pgsql2shp for a complete list of arguments and their meanings.
Where tablename is your newly created table, SRID is the SRID (spatial reference ID) for the geometry in this table, and geomtype is the type of geometry: POINT, LINE, POLYGON, etc.
Attribute data to text format
CSV format, from within the psql client
\copy tablename TO 'filename.csv' CSV HEADER
CSV format, via psql client
echo "select column_list from table_list " | psql --tuples --no-align -F "," database > file.csv
Tabular data to HTML format, via psql client See output below:
echo "select column_list from table_list " | psql --html database > file.html
HTML output from psql
| area |
compname |
| 132472.230854819 |
Hilmar variant |
| 322819.967391312 |
Oneil |
| 362729.418301135 |
Carranza |
| 431948.171760353 |
Tuff rockland |
| 448784.927049035 |
Gravel pits |
| 500763.225267798 |
Snelling variant |
| 518860.954990617 |
Foster |
| 571640.132661382 |
Alamo |
| 648973.748756059 |
Toomes |
| 924327.631201791 |
Dumps |
(10 rows)
Example Spatial SQL Operations on Point, Line and Polygon Geometry Types
- Points
- Lines
- Polygons
- Mixed
Affine Transformation Operations in PostGIS
Overview
The ST_Affine() function from PostGIS is useful for manipulating geometries, but requires the elements of a transformation matrix. This page documents progress on automating the computation of the transformation matrix by least-squares (Bruce Rindahl) via SQL. This would allow a pure PostGIS solution to computing and applying affine transformations to geometry data.
An open-source algorithm for computing the transformation matrix
Example code from GRASS (v.transform) was used as a template.
Approach
Compute transformation matrix based on a table of control points, stored as numbers.
Evaluation of results
Comparable to output from a similar analysis done in R, and the original algorithm as implemented in v.transform (GRASS).
Case Study: Fixing Bad TIGER Line data with R and PostGIS
Example of bad Tiger data in Stanislaus County: Red lines are the original road network, green lines are the corrected road network.
The Problem
The US Census does a nice job of collecting all sorts of geographic and demographic information every 10 years. This data is available free of charge in the rather complex and soon to be replaced TIGER/LINE format. While this data covers the entire US down to the local road level, there are numerous errors and even extreme cases of coordinate-shift. Here is an example from Stanislaus County, California. The original TIGER data (red lines) are offset several hundred meters from the imagery. While it is not clear what may have caused the problem, it can be fixed without much effort using an affine transformation. We do not have the transformation matrix, however it can be 'fit' to a set of control points by several methods. The general form of the affine transform can be conveniently represented in homogeneous coordinates as:
Affine Matrix Form in homogeneous coordinates:New coordinates on the left-hand side, old coordinates on the right-hand side. The transformation matrix is the 3x3 matrix in the center.
The Solution
We first need a set of control points, good and bad coordinates. This can be accomplished in several ways, we used the d.where command in GRASS:
# load road subset:
v.in.ogr dsn=PG:'dbname=tiger2005se user=xxx password=xxx host=xxx' layer=stan_roads out=s
#
# pick up some control points: note that they will be interleaved with this approach
# odd line numbers are "bad" coordinates
# even line numbers are "good" coordinates
d.where > stan.c_points
#
# we can separate the "bad" and "good" points in R...
Computing the transformation matrix can be done with a simple regression between 'good' and 'bad' coordinates in R. Note that this approach was suggested by Prof. Brian Ripley on the R-help mailing list.
Compute the Affine Transformation Matrix in R
## read in control points
## the good and bad points are interleaved
## odd numbers are 'bad' points
## s <- read.table('stan.c_points')
## s.bad <- s[seq(from=1, to=nrow(s), by=2), ]
## s.good <- s[seq(from=2, to=nrow(s), by=2), ]
##
## make a composite dataframe
## x,y = bad data
## nx,ny = good data
## c <- data.frame(x=s.bad$V1, y=s.bad$V2, nx=s.good$V1, ny=s.good$V2)
##
## save a local copy in the format that v.transform uses
## write.table(c, file='grass_control_pts.txt', row.names=F, col.names=F)
##
## from now on just use the simplified format:
c <- read.table('grass_control_pts.txt')
names(c) <- c('x','y','nx','ny')
## compute transformation matrix, and check model fit:
l <- lm(cbind(nx,ny) ~ x + y, data=c)
##
## see at bottom of page -->
summary(l)
## convert to affine transform matrix to the form needed by PostGIS:
## transpose the regression coefficient matrix:
t(coef(l))
(Intercept) x y
nx 5017.082 1.00231639 0.00918962
ny -28138.395 -0.01352029 0.99740077
## check graphically:
x <- seq(-2080000, -2040000, by=1000)
y <- seq(-20000, 5000, by=2000)
d <- expand.grid(x=x, y=y)
p <- predict(l, d)
par(mfcol=c(1,2), pty='s')
plot(y ~ x, data=c, main='Control Points', cex=.4, pch=16)
points(ny ~ nx, data=c, col='red', cex=.4, pch=16)
## arrows(c$x, c$y, c$nx, c$ny, len=0.05, col='gray')
plot(d, cex=0.2, main='Shifted Grid')
## arrows(d$x, d$y, p[,1], p[,2], len=0.05, col='gray')
points(p, col='red', cex=.2)
## sample transformation
print(predict(l, data.frame(x=-2078417.35893968, y=-14810.57043808)), digits=10)
nx ny
-2078350.804 -14809.67334
Establishing the transformation based on control points: Red points represent where the coordinates should be. Black points are the original and incorect coordinates.
Check Affine Transform Results in PostGIS
-- matrix format:
-- R:
-- | xoff a b |
-- | yoff d e |
-- postgis wants:
-- ST_Affine(geom, a, b, d, e, xoff, yoff)
-- as a query:
SELECT asText(
ST_Affine(
'POINT(-2078417.35893968 -14810.57043808)',
1.00231639, 0.00918962, -0.01352029, 0.99740077, 5017.082, -28138.395
)
) ;
astext
--------------------------------------------
POINT(-2078350.80564006 -14809.6639251817)
-- This is the same as what R says!
Perform Affine Transformation in PostGIS
--
-- Step 1. Alter the geometry of the bad data in-place
-- note that we have a backup in 'stan_roads'
--
UPDATE roads SET wkb_geometry = ST_Affine(wkb_geometry, 1.00231639, 0.00918962, -0.01352029, 0.99740077, 5017.082, -28138.395 ) WHERE module = 'TGR06099' ;
Regression Diagnostics
Response nx :
Call:
lm(formula = nx ~ x + y, data = d)
Residuals:
Min 1Q Median 3Q Max
-207.088 -23.856 8.614 21.245 161.610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.017e+03 1.369e+03 3.666 0.00079 ***
x 1.002e+00 6.654e-04 1506.386 < 2e-16 ***
y 9.190e-03 9.419e-04 9.756 1.20e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 52.25 on 36 degrees of freedom
Multiple R-Squared: 1, Adjusted R-squared: 1
F-statistic: 1.322e+06 on 2 and 36 DF, p-value: < 2.2e-16
Response ny :
Call:
lm(formula = ny ~ x + y, data = d)
Residuals:
Min 1Q Median 3Q Max
-39.835 -18.459 -4.556 15.311 94.226
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.814e+04 7.409e+02 -37.98 <2e-16 ***
x -1.352e-02 3.602e-04 -37.54 <2e-16 ***
y 9.974e-01 5.099e-04 1956.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.28 on 36 degrees of freedom
Multiple R-Squared: 1, Adjusted R-squared: 1
F-statistic: 2.187e+06 on 2 and 36 DF, p-value: < 2.2e-16
Comparision with output from v.transform
First the output from R:
Looking at the residuals from the regression model used to map bad coordinates (x,y) to good coordinates (nx,ny):
## we made the linear model object 'l' above
## extract as dataframe
## residuals are computed for x and y separately
r <- as.data.frame(resid(l))
## compute vector product of the (x,y) residuals:
c$resid <- sqrt(r$nx^2 + r$ny^2)
print(c)
x y nx ny resid
1 -2078417 -14810.570 -2078314 -14838.378 46.617600
2 -2078743 -16057.955 -2078636 -16081.790 62.041274
3 -2077261 -16435.348 -2077170 -16463.156 40.905132
4 -2076709 -14405.369 -2076606 -14433.177 29.406399
5 -2074179 -15830.248 -2074084 -15901.558 33.111981
6 -2073850 -15707.435 -2073763 -15798.554 37.362736
7 -2073450 -13873.171 -2073359 -13920.712 21.623235
8 -2072359 -15204.613 -2072276 -15323.138 38.997678
9 -2072545 -14402.596 -2072450 -14513.219 32.918889
10 -2072189 -16022.434 -2072098 -16129.106 33.834074
11 -2071991 -16856.058 -2071928 -16942.976 6.277554
12 -2068407 -12999.396 -2068296 -13133.170 6.579285
13 -2069870 -12613.813 -2069764 -12731.848 2.357631
14 -2067635 -13188.253 -2067517 -13337.765 11.604519
15 -2066931 -13377.110 -2066809 -13518.753 22.719625
16 -2067411 -15084.692 -2067313 -15190.924 41.907273
17 -2066795 -18714.093 -2066741 -18846.019 14.541358
18 -2066384 -17080.538 -2066299 -17212.464 26.717495
19 -2068634 -19742.339 -2068580 -19835.464 27.483654
20 -2053326 -16930.710 -2053276 -17226.351 65.746074
21 -2051797 -17321.500 -2051899 -17579.762 227.516944
22 -2068307 2826.921 -2068066 2638.276 12.587853
23 -2067543 2648.205 -2067328 2449.631 37.729747
24 -2067126 4276.510 -2066904 4081.246 46.774630
25 -2066748 4170.604 -2066527 4001.816 59.509843
26 -2066068 2292.295 -2065860 2094.699 46.681553
27 -2065337 2107.872 -2065126 1900.397 43.386956
28 -2064606 1913.570 -2064378 1692.922 26.460389
29 -2064199 3558.561 -2063961 3356.401 47.742696
30 -2037464 6512.455 -2037076 5864.398 50.762994
31 -2036722 6825.682 -2036338 6199.227 22.699467
32 -2036876 6366.642 -2036498 5742.888 22.120176
33 -2040225 7150.180 -2039706 6575.029 161.631199
34 -2041064 7144.779 -2040732 6569.629 26.657582
35 -2044702 -15548.033 -2044564 -16024.903 23.844817
36 -2043992 -15723.521 -2043824 -16223.282 48.063840
37 -2043790 -14907.119 -2043611 -15383.990 34.844851
38 -2040616 -14820.445 -2040453 -15349.974 21.196233
39 -2039595 -15081.427 -2039485 -15588.263 47.263287
The Root-Mean-Square-Error (RMSE) for the fitted transform (in meters) is:
apply( (fitted(l) - c[, c('nx', 'ny')])^2 , 2, function(x) sqrt(mean(x)) )
nx ny
50.19954 27.17333
#converting to the vector sum of RMSE (as reported in v.transform):
sqrt(50.19954^2 + 27.17333^2)
57.08225
The output from v.transform on the same set of control points:
Transformation Matrix
| xoff a b |
| yoff d e |
-------------------------------------------
5301.399323 1.002469 0.009172
-28155.882288 -0.013530 0.997547
-------------------------------------------
full output including the residuals:
CHECK MAP RESIDUALS
Current Map New Map
POINT X coord Y coord | X coord Y coord | residuals
1. -2078417.36 -14810.57 | -2078314.07 -14838.38 | 46.81
2. -2078743.11 -16057.95 | -2078635.85 -16081.79 | 62.22
3. -2077261.34 -16435.35 | -2077169.97 -16463.16 | 41.05
4. -2076709.16 -14405.37 | -2076605.87 -14433.18 | 29.59
5. -2074178.76 -15830.25 | -2074083.67 -15901.56 | 33.21
6. -2073849.93 -15707.44 | -2073762.78 -15798.55 | 37.42
7. -2073449.80 -13873.17 | -2073358.68 -13920.71 | 21.62
8. -2072358.86 -15204.61 | -2072275.89 -15323.14 | 39.02
9. -2072544.55 -14402.60 | -2072449.73 -14513.22 | 32.97
10. -2072188.97 -16022.43 | -2072098.11 -16129.11 | 33.87
11. -2071991.43 -16856.06 | -2071928.22 -16942.98 | 6.27
12. -2068406.55 -12999.40 | -2068296.38 -13133.17 | 6.60
13. -2069870.19 -12613.81 | -2069763.96 -12731.85 | 2.33
14. -2067635.38 -13188.25 | -2067517.35 -13337.76 | 11.63
15. -2066931.10 -13377.11 | -2066809.13 -13518.75 | 22.74
16. -2067411.11 -15084.69 | -2067312.75 -15190.92 | 41.93
17. -2066795.16 -18714.09 | -2066740.84 -18846.02 | 14.64
18. -2066383.87 -17080.54 | -2066298.50 -17212.46 | 26.74
19. -2068634.37 -19742.34 | -2068580.05 -19835.46 | 27.53
20. -2053326.48 -16930.71 | -2053275.51 -17226.35 | 66.09
21. -2051797.30 -17321.50 | -2051899.25 -17579.76 | 227.91
22. -2068307.24 2826.92 | -2068065.64 2638.28 | 12.41
23. -2067542.73 2648.21 | -2067327.61 2449.63 | 37.44
24. -2067125.72 4276.51 | -2066903.98 4081.25 | 46.40
25. -2066748.43 4170.60 | -2066526.69 4001.82 | 59.12
26. -2066067.79 2292.29 | -2065860.31 2094.70 | 46.35
27. -2065336.69 2107.87 | -2065125.92 1900.40 | 43.07
28. -2064605.58 1913.57 | -2064378.35 1692.92 | 26.16
29. -2064199.15 3558.56 | -2063961.13 3356.40 | 47.43
30. -2037464.39 6512.45 | -2037075.56 5864.40 | 50.66
31. -2036721.82 6825.68 | -2036338.39 6199.23 | 22.54
32. -2036875.74 6366.64 | -2036497.71 5742.89 | 21.95
33. -2040224.67 7150.18 | -2039706.23 6575.03 | 161.54
34. -2041064.45 7144.78 | -2040732.32 6569.63 | 26.74
35. -2044701.68 -15548.03 | -2044564.34 -16024.90 | 23.64
36. -2043992.10 -15723.52 | -2043824.24 -16223.28 | 47.60
37. -2043789.90 -14907.12 | -2043610.60 -15383.99 | 34.35
38. -2040615.94 -14820.44 | -2040453.30 -15349.97 | 20.77
39. -2039594.70 -15081.43 | -2039485.02 -15588.26 | 47.85
Number of points: 39
Residual mean average: 57.082951
Higher Order Transformations
... Continuing from the initial example session in R ...
An affine transformation is based on a linear mapping between two coordinate-spaces. Testing for non-linearity (i.e. higher order model terms) can be a useful diagnostic in choosing the simpler affine transform.
Compute the difference between the good and bad coordinates
## compute vector difference and save back to original data frame:
c$sqdiff <- with(c, sqrt((nx-x)^2 + (ny-y)^2))
Generate two linear models:
## generate simple model
l.simple <- lm(sqdiff ~ nx + ny, data=c)
## model with higher order terms (up to 3rd-order):
## note to self: poly() computes orthagonal polynomials
l.poly <- lm(sqdiff ~ poly(nx, 3) + poly(ny, 3), data=c)
## summarize the complex model:
summary(l.poly)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 302.38 3.97 76.174 < 2e-16 ***
poly(nx, 3)1 1091.17 35.82 30.466 < 2e-16 ***
poly(nx, 3)2 165.59 32.78 5.051 1.71e-05 ***
poly(nx, 3)3 -51.21 28.74 -1.782 0.0843 .
poly(ny, 3)1 417.35 31.97 13.056 2.30e-14 ***
poly(ny, 3)2 -18.50 30.04 -0.616 0.5425
poly(ny, 3)3 19.41 35.49 0.547 0.5882
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Is one model significantly better than the other?
## check via RMSE:
## simple model
sqrt( mean((fitted(l.simple) - c$sqdiff)^2) )
36.54747
## model with 3rd-order polynomial
sqrt( mean((fitted(l.poly) - c$sqdiff)^2) )
22.45530
## the anova function can compare nested models:
anova(l.simple, l.poly)
Analysis of Variance Table
Model 1: sqdiff ~ nx + ny
Model 2: sqdiff ~ poly(nx, 3) + poly(ny, 3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 36 52093
2 32 19665 4 32428 13.192 1.865e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Check visually:
## re-make grid of points to predict on:
## and save two versions to 'stack-up'
x <- seq(-2080000, -2040000, by=1000)
y <- seq(-20000, 5000, by=2000)
d.simple <- expand.grid(x=x, y=y)
d.poly <- expand.grid(x=x, y=y)
## and predict the trend in differences:
d.simple$diff <- predict(l.simple, data.frame(nx=d$x, ny=d$y))
d.poly$diff <- predict(l.poly, data.frame(nx=d$x, ny=d$y))
## stack for grouped plotting
library(lattice)
library(RColorBrewer)
d.combined <- make.groups(simple=d.simple, poly=d.poly)
## plot:
wireframe(diff ~ x * y | which, data=d.combined, scales=list(arrows=TRUE), drape=TRUE, col.regions=brewer.pal(n=9, name='Oranges'), cuts=8, col='white')
Testing for linearity: Two visualizations of the deviance between coordinates positions, at the control point locations.
Conclusions:
It seems that a second order term was only warranted along the x-direction. The more complex model based on 3rd-order polynomials results in a significantly lower RMSE (about 10 meters lower), and is shown to be a better descriptor of variance in the test of nested models.
At the map scale in which the corrected data will be presented, the extra accuracy suggested by the more complex model (coordinate transformation function) is negligible. This allows for the simpler model, which can be directly used by the convenient ST_Affine() function in PostGIS for the heavy-lifting.
Transformation parameters query
The following is a proof of concept query showing how a PostgreSQL query could give the transformation parameters required for an affine transformation. The source of the procedure is from GRASS at transform.c.
Note that the v.transform code expects the input file (control points) to be in the form ax ay bx by, where 'a' is the starting coordinate system (the bad coordinates in the previous example) and 'b' is the target system (the good coordinates). The
SQL version is as follows:
Sample Session
-
CREATE TABLE link (
gid integer NOT NULL,
a_x double precision,
a_y double precision,
b_x double precision,
b_y double precision
);
-- load data into table
-- see attached file at bottom of page
-- possibly use the file 'grass_control_pts.csv' ???
COPY link FROM 'control_points.txt' ;
-
CREATE TEMPORARY TABLE cc AS
SELECT
(SELECT count(b_x) FROM link) AS cc00,
(SELECT sum(b_x) FROM link) AS cc01,
(SELECT sum(b_y) FROM link) AS cc02,
sum(l1.b_x * l2.b_x) AS cc11,
sum(l1.b_x * l2.b_y) AS cc12,
sum(l1.b_y * l2.b_y) AS cc22
FROM link l1 , link l2 WHERE l1.gid = l2.gid;
-
CREATE TEMPORARY TABLE aa AS
SELECT
(SELECT sum(a_y) FROM link) AS aa0,
sum(l1.a_y * l2.b_x) AS aa1,
sum(l1.a_y * l2.b_y) AS aa2
FROM link l1 , link l2 WHERE l1.gid = l2.gid;
-
CREATE TEMPORARY TABLE bb AS
SELECT
(SELECT sum(a_x) FROM link) AS bb0,
sum(l1.a_x * l2.b_x) AS bb1,
sum(l1.a_x * l2.b_y) AS bb2
FROM link l1 , link l2 WHERE l1.gid = l2.gid;
-
CREATE TEMPORARY TABLE det AS
SELECT
cc00*cc11*cc22 +
cc01*cc12*cc02 +
cc02*cc01*cc12 -
cc00*cc12*cc12 -
cc01*cc01*cc22 -
cc02*cc11*cc02 AS det FROM cc;
-
CREATE TEMPORARY TABLE inv_cc AS
SELECT
(cc11*cc22-cc12*cc12)/det AS cc00,
(cc12*cc02-cc01*cc22)/det AS cc01,
(cc01*cc12-cc11*cc02)/det AS cc02,
(cc00*cc22-cc02*cc02)/det AS cc11,
(cc01*cc02-cc00*cc12)/det AS cc12,
(cc00*cc11-cc01*cc01)/det AS cc22
FROM cc,det;
-
SELECT
(aa0*cc00+aa1*cc01+aa2*cc02) AS b0,
(aa0*cc01+aa1*cc11+aa2*cc12) AS b1,
(aa0*cc02+aa1*cc12+aa2*cc22) AS b2,
(bb0*cc00+bb1*cc01+bb2*cc02) AS b3,
(bb0*cc01+bb1*cc11+bb2*cc12) AS b4,
(bb0*cc02+bb1*cc12+bb2*cc22) AS b5
FROM aa,bb,inv_cc;
This query requires a table called link with the following fields - gid (primary key), a_x, a_y , b_x, b_y. The 'a' values are the 'from' coordinates and the 'b' values are the 'to' coordinates. Using the attached control points the result of this query is:
b0 | b1 | b2 | b3 | b4 | b5
------------------+---------------------+-------------------+------------------+------------------+---------------------
-28138.394850347 | -0.0135202854235867 | 0.997400773420259 | 5017.08164289594 | 1.00231638907948 | 0.00918961946271679
These are the exact results from 'R'
User defined Function in PL/pgSQL to compute the transformation parameters
Based on the results of the proof of concept example developed previously, a single function was developed in the procedural language for the PostgreSQL database system called PL/pgSQL. The only input parameter to this procedure is a text string that results in a table in the following format:
gid (integer), from_x (float), from_y (float), to_x (float), to_y (float)
Note the table must have the above fields but they can be in any order and can have additional fields. The gid field must be unique for each record.
The use of a SQL query found to be the simplest way to avoid the difficulties in programing the procedure without the need for temporary tables. An added benefit is the control point data can be in almost any format as long as it can be arranged in the format specified above. For example if the "from" points are in a geometry column (the_geom) in a table called from_pts and the corresponding "to" points are in a similar table called to_pts with a common attribute called "link_id", then the query would be:
SELECT from_pts.link_id AS gid, x(from_pts.the_geom) AS from_x, y(from_pts.the_geom) AS from_y, x(to_pts.the_geom) AS to_x, y(to_pts.the_geom) AS to_y FROM from_pts, to_pts WHERE from_pts.link_id = to_pts.link_id
Other table layouts and queries are possible depending on the manner in which the control points are collected.
The following is the SQL code to add a new procedure called trans_param() into a PostGIS database:
-- Function: trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision)
-- DROP FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision);
CREATE OR REPLACE FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision) AS
$BODY$
DECLARE
cc_row record;
cc_det float;
inv_cc00 float;
inv_cc01 float;
inv_cc02 float;
inv_cc11 float;
inv_cc12 float;
inv_cc22 float;
BEGIN
EXECUTE 'SELECT
count(a.to_x) as cc00,
sum(a.to_x) as cc01,
sum(a.to_y) as cc02,
sum(a.to_x * b.to_x) as cc11,
sum(a.to_x * b.to_y) as cc12,
sum(a.to_y * b.to_y) as cc22,
sum(a.from_y) as aa0,
sum(a.from_y * b.to_x) as aa1,
sum(a.from_y * b.to_y) as aa2,
sum(a.from_x) as bb0,
sum(a.from_x * b.to_x) as bb1,
sum(a.from_x * b.to_y) as bb2
from (' || sql || ') a,(' || sql|| ') b WHERE a.gid = b.gid' INTO cc_row;
SELECT INTO cc_det
cc_row.cc00*cc_row.cc11*cc_row.cc22 +
cc_row.cc01*cc_row.cc12*cc_row.cc02 +
cc_row.cc02*cc_row.cc01*cc_row.cc12 -
cc_row.cc00*cc_row.cc12*cc_row.cc12 -
cc_row.cc01*cc_row.cc01*cc_row.cc22 -
cc_row.cc02*cc_row.cc11*cc_row.cc02;
SELECT INTO inv_cc00 (cc_row.cc11*cc_row.cc22-cc_row.cc12*cc_row.cc12)/cc_det;
SELECT INTO inv_cc01 (cc_row.cc12*cc_row.cc02-cc_row.cc01*cc_row.cc22)/cc_det;
SELECT INTO inv_cc02 (cc_row.cc01*cc_row.cc12-cc_row.cc11*cc_row.cc02)/cc_det;
SELECT INTO inv_cc11 (cc_row.cc00*cc_row.cc22-cc_row.cc02*cc_row.cc02)/cc_det;
SELECT INTO inv_cc12 (cc_row.cc01*cc_row.cc02-cc_row.cc00*cc_row.cc12)/cc_det;
SELECT INTO inv_cc22 (cc_row.cc00*cc_row.cc11-cc_row.cc01*cc_row.cc01)/cc_det;
SELECT INTO a cc_row.bb0*inv_cc01+cc_row.bb1*inv_cc11+cc_row.bb2*inv_cc12;
SELECT INTO b cc_row.bb0*inv_cc02+cc_row.bb1*inv_cc12+cc_row.bb2*inv_cc22;
SELECT INTO d cc_row.aa0*inv_cc01+cc_row.aa1*inv_cc11+cc_row.aa2*inv_cc12;
SELECT INTO e cc_row.aa0*inv_cc02+cc_row.aa1*inv_cc12+cc_row.aa2*inv_cc22;
SELECT INTO xoff cc_row.bb0*inv_cc00+cc_row.bb1*inv_cc01+cc_row.bb2*inv_cc02;
SELECT INTO yoff cc_row.aa0*inv_cc00+cc_row.aa1*inv_cc01+cc_row.aa2*inv_cc02;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision) OWNER TO postgres;
Currently there is no error checking in the code if the determinant is zero.
To use the procedure simply use: SELECT trans_param('my SQL text')
The data from this series of articles is stored in a table called link. The id of the points is gid and the "from" values are b_x and b_y. The "to" values are a_x and a_y. Thus the query is:
SELECT gid,b_x AS from_x, b_y AS from_y, a_x AS to_x, a_y AS to_y FROM link
The result of the procedure is:
SELECT * FROM trans_param(
'SELECT gid,b_x as from_x, b_y as from_y, a_x as to_x, a_y as to_y from link')
0.997546509279282;-0.00917177514909895;0.0135300872142122;1.00246938174737;-5301.39933295548;28155.8822879205
Which matches the results returned from GRASS and R. Additional queries will be developed to give a table of residuals, the RMS error and the actual transformation of geometry.
Control Points Table Format
This is a start for discussions to create a series of function to perform an affine transformation of a PostGIS data set using a table of control points.
The first step is what is the format or layout of the control points?
I don't think point geometries is necessarily a good idea. The points must have an exact 1 to 1 relationship. To assure this you either have to maintain absolute integrity on the keys between two tables or have two geometries in one table. Both would be a hassle. An interleaved table format also would give me problems because the queries get really difficult and what if one version used "good" points in the odd rows and the next one put them in the even ones? In addition, if you use a seguential id and add one row "good" then add a "bad" one, then realize the bad one is really bad and delete it then the gid's will not be odd/even anymore in relation to "good" vs "bad". I think ESRI has the right idea here (God did I actually say that?). Look at the link table interface when you are georeferencing an image. It gives the id, XSource, YScource, Xmap, YMap, and the residual error. This could easily be done with the code - just one more query.
Given a table in this format a query could be done to give the RMS error. Also the table could be returned with the residual errors. When the user is happy then the table and the geometry could be input to transform the geometry.
Bruce
Analysis of SSURGO Data in PostGIS: An Overview
News
It looks like the NCSS has constructed a web-based, SQL interface to their main database. This new tool was recently highlighted in issue 40 of the NCSS Newsletter and looks like a promising new tool with good documentation. A listing "web services" offered by the NCSS is listed on this page.
Overview
The analysis of SSURGO data is complicated by a series of hierarchical, one-to-many tabular relationships between spatial and attribute data. Understanding the structure of the SSURGO database is critical for correct interpretation and aggregation of soil properties. Before undertaking any SSURGO-based analysis please take some time to become familliar with the details on the product, including intended uses, minimum mapping units sizes, and other important details. In addition, becoming familliar with the SSURGO metadata is a must. The metadata page includes detailed descriptions of table structure, column names and units, as well as important information on the sources of much of the tabular data included in a SSURGO database.
SSURGO data can be downloaded by survey area from The Soil Datamart, with spatial data delivered in shapefile format and attribute data delivered as plain text. Unfortunately, an M.S. Access database template is required to utilize SSURGO attribute data as delivered from the Soil Datamart. For assistance with this procedure please see the NRCS SSURGO page. Using this approach, most analysis of SSURGO must be done survey-by-survey within a GIS environment. For general instructions see this document.
Several online applications allow for simple interaction with the SSURGO database without the need for a GIS or RDBMS knowledge. Some examples include:
An Open Source Approach to SSURGO
We have developed an alternate approach to working with SSURGO data using PostGIS, a spatially-enabled version of the popular and open source relational database management system Postgresql, to store all spatial and tabular data for 138 survey areas. This approach facillitates rapid access, analysis, and aggregation of over half a million soil polygons. SQL (structured query language) is used to directly interact with both soil spatial and attribute data. If other forms of spatial data are imported into PostGIS (such as landcover, climatic data, etc.) nearly all spatial and attribute analysis can be done entirely from PostGIS. A series of examples illustrating common tasks will be presented in the following pages.
General Approach to Working with SSURGO (also outlined in this document) (more ideas on map unit composition)
- Identify the soil properties that are to be included in some analysis
- Decide on the appropriate form of aggregation to be used to summarize horizons
- depth weighted (i.e. to calculate average percent clay through horizons)
- top 1m (i.e. to summarize shrink-swell capacity of the topsoil)
- top horizon (i.e. to summarize surface organic carbon)
- profile sum (i.e. to calculate total water holding capacity)
- most limiting (i.e. to calculate the most limiting hydraulic conductivity within the soil profile)
- Aggregate horizon data
- Decide on the appropriate form of aggregation to be used to summarize components
- component percent weighted (i.e. this will include information from each component, weighted by their estimated percentage of the entire map unit)
- largest component (i.e. this usually result in the selection of the 'dominant' component, however when there are multuple components with the same estimated percentage ties will occur)
- major component flag (i.e. components flagged as a 'major component' by the NRCS represent dominant soil types within a map unit. note that there are sometimes multiple components flagged as 'major components'.)
- dominant condition (i.e. this is usually used for categorical data like hydric conditions. aggregation is performed by selecting the most frequent condition within a map unit)
- Aggregation of component data
- join the above aggregated data (2x aggregation process for horizon data) to the map unit polygons
- See the diagram at the bottom of this page for a graphical summary
SSURGO discontinuity example: Boundary between Glenn and Colusa counties, illustrating differences in soil survey vintage.
Notes on the Format of SSURGO
- Adjacent surveys may have been composed by different individuals, and may be of widely different vintages. Any given survey must comply with basic standards, but older surveys reflect a more generalized approach than more modern surveys. The figure to the right illustrates such differences.
- Polygons represent a repeating pattern of legend entries: groups of map-able soil concepts called map units
- Map unit data is stored in the mapunit table, and is referenced by the field mukey
- Pre-aggregated map unit data is stored in the muaggatt table, and is referenced by the field mukey
- Map units are comprised of multiple, unmapped soil types called 'components'
- Component data is stored in the component table, and is referenced by the field cokey
- Soil components (or soil type) are associated with multiple horizons
- Horizon data is stored in the chorizon table, and is referenced by the field cokey
- Since there is a 1:many:many (mapunit:component:horizon) relationship between spatial and horizon-level soil property data two aggregation steps are required in order to produce a thematic map
- This diagram illustrates the hierarchy of scales involved in soil survey information.
SSURGO Database Structure Diagram
Logistics: Getting Connected and Executing Queries
Building and Saving SQL code snippets
A simple text editor is the best environment for working on (and saving!) your SQL queries. Linux users are encouraged to use either 'Kate' or 'Kwrite'. Windows users should look into Notetab Light or SciTE. Mac users should check out TextWrangler.
Connecting to the database
In general, the simplest way to interact with our composite soil survey database is by connecting to the best with an SSH client. Mac/Linux users have this functionality built-in. Windows users will need to use something like Putty, or Xming. Once an ssh connection with the beast has been setup, you can connect to the database with the following:
psql -U soil ssurgo_combined
where psql is the postgresql client program, -U soil means connect as the user called "soil", and ssurgo_combined is the name of the actual database.
You can quit from the database shell using \q (followed by enter). Typing \? (followed by enter) will give a list of commands available in the psql shell.
Query structure and SQL
Numerous resources exisit for learning about SQL. See the attached PDF presentation at the bottom of this page for some SSURGO-related examples. I would recommend looking at the first couple chapters from the PostgreSQL book by Douglas and Douglas (the big purple book on the shelf). For an interactive learning approach SQL Zoo seems like a good start. In general most queries will have the format:
SELECT column_x, column_y, column_z
FROM table_x
WHERE column_x = 'something'
-- optional
GROUP BY column_x
ORDER BY column_x ; -- semi-colon denotes end of SQL statement
A simple query: column selection and filtering
Once connected it is possible to issue queries to the database. The results of a query are normally returned to the screen. For example, asking for the horizon boundaries and horizon names from the componenent identified with the given cokey (467038:646635) might look like this:
-- the query
SELECT cokey, hzname, hzdept_r, hzdepb_r
FROM chorizon
WHERE cokey = '467038:646635' ;
-- the results
cokey | hzname | hzdept_r | hzdepb_r
---------------+--------+----------+----------
467038:646635 | Ap | 0 | 18
467038:646635 | Bw | 18 | 41
467038:646635 | Bk1 | 41 | 69
467038:646635 | Bk2 | 69 | 109
467038:646635 | Bk3 | 109 | 145
467038:646635 | Bk4 | 145 | 183
A more complicated query: column calculation and aggregation
Compute the total water holding capacity for a given component, identified by (467038:646635). Note that several operations are being performed on the data:
- compute the available water holding capacity (in centemeters) for each horizon [green text]
- compute the profile depth (in centemeters) for each horizon [green text]
- aggregate the profile data by summing AHC and depth of each horizon [red text]
-- the query
select cokey, sum( (hzdepb_r - hzdept_r) * awc_r) as component_whc, sum( (hzdepb_r - hzdept_r) ) as depth
from chorizon
where cokey = '467038:646635'
group by cokey ;
-- the results
cokey | component_whc | depth
---------------+---------------+-------
467038:646635 | 27.91 | 183
Note that it is also possible to save the results of a query into a new table. This is the usual strategy when performing any analysis that returns geometry (soil polygons, etc.) as geometry data cannot be visualized in a text display. Tables created in this manner can be exported from the database for map creation or further analysis. Spatial tables can be viewed directly by applications like QGIS or Mapserver. A simple method of accessing PostGIS data is not yet possible with ArcGIS.
Checking Type Locations
Just Checking
-- NAD27 to NAD83
echo 119d7\'4\"W 36d23\'13\"N | cs2cs +proj=latlong +datum=NAD27 +to +proj=latlong +datum=NAD83 -f "%.6f"
-119.118718 36.386894 0.000037
-- the relevent SRIDs
srid | proj4text
------+----------------------------------------------------
4267 | +proj=longlat +ellps=clrk66 +datum=NAD27 +no_defs
4269 | +proj=longlat +ellps=GRS80 +datum=NAD83 +no_defs
4326 | +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs
-- NAD27 type location point
\SET pt Transform(SetSRID(ST_MakePoint(-119.1177777, 36.38694444), 4267), 9001)
-- find intersecting map unit, and the NAD83 coords
SELECT AsText(Transform(:pt, 4269)) AS NAD83_pt, mapunit_poly.mukey, muname
FROM mapunit_poly
JOIN mapunit
ON mapunit_poly.mukey = mapunit.mukey
AND st_intersects(wkb_geometry, :pt) ;
nad83_pt | mukey | muname
-------------------------------------------+--------+--------------------------------------------------------
POINT(-119.118718055326 36.3868941884307) | 467194 | Grangeville sandy loam, drained, 0 TO 2 percent slopes
-- 250 m buffer around NAD27 type location coords
\SET box st_buffer(Transform(SetSRID(ST_MakePoint(-119.1177777, 36.38694444), 4267), 9001), 250)
-- select overlapping map units and their associated percent overlap
SELECT mapunit_poly.mukey, muname,
round((ST_Area(ST_Intersection(wkb_geometry, :box)) / ST_Area(:box)) * 100) AS area_pct
FROM mapunit_poly
JOIN mapunit
ON mapunit_poly.mukey = mapunit.mukey
AND st_intersects(wkb_geometry, :box)
ORDER BY area_pct DESC ;
mukey | muname | area_pct
--------+--------------------------------------------------------+----------
467194 | Grangeville sandy loam, drained, 0 TO 2 percent slopes | 70
463596 | Grangeville silt loam, drained | 12
467206 | Riverwash | 11
467210 | Tujunga loamy sand, 0 TO 2 percent slopes | 7
Learning by Example: Common Queries
Selection, Filtering and Sorting
- Selecting a summary of horizon thickness, sand, silt, clay and AWC for a given component (chorizon table).
SELECT hzname AS name, hzdepb_r - hzdept_r AS thickness, sandtotal_r AS sand, silttotal_r AS silt, claytotal_r AS clay, awc_r AS awc
FROM chorizon
WHERE cokey = '467038:646635'
ORDER BY hzdept_r;
Query Results
name | thickness | sand | silt | clay | awc
--------+-----------+------+------+------+------
Ap | 18 | 35 | 37 | 28 | 0.16
Bw | 23 | 35 | 40 | 25 | 0.15
Bk1 | 28 | 35 | 40 | 25 | 0.16
Bk2 | 40 | 35 | 40 | 25 | 0.16
Bk3 | 36 | 35 | 40 | 25 | 0.16
Bk4 | 38 | 60 | 25 | 15 | 0.13
- Select some common fields from the component table:
SELECT cokey, majcompflag, comppct_r, wei, weg, tfact
FROM component
WHERE mukey = '467038'
ORDER BY comppct_r DESC;
Query Results
cokey | majcompflag | comppct_r | wei | weg | tfact
---------------+-------------+-----------+-----+-----+-------
467038:646635 | Yes | 85 | 48 | 6 | 5
467038:646636 | No | 5 | | |
467038:646638 | No | 4 | | |
467038:646640 | No | 2 | | |
467038:646639 | No | 2 | | |
467038:646637 | No | 2 | | |
- Create a new table 'yolo_comp' using the component table, for a specific survey area (Yolo County), with components sorted by their component percentages.
CREATE TABLE yolo_comp AS
SELECT * FROM component
WHERE areasymbol = 'ca113'
ORDER BY mukey, comppct_r DESC ;
The resulting table can be copied to a CSV file (in the current working directory) like this: \copy yolo_comp TO 'yolo_comp.csv' CSV HEADER
If you are done with the table, remove it with the following SQL: DROP TABLE yolo_comp ;
Identifying the Largest Components
Overview:
Two methods for the selection of the largest components (based on the comppct_r column) within map units. This approach to selecting a single component per map unit is appropriate in situations where a single representative feature or property is sought. The partitioning of components within a map unit is closely related to the map unit type: complex, association, consociation or an "undifferentiated group". A breakdown of the number of components per each map unit type is summarized by the following query (26400 map units / 45971 major components):
SELECT mapunit.mukind, round(avg(n_components)) AS avg, min(n_components), max(n_components)
FROM mapunit JOIN
(
SELECT mapunit.mukey, count(component.mukey) AS n_components
FROM
mapunit JOIN component
ON mapunit.mukey = component.mukey
WHERE component.majcompflag = 'Yes' AND mapunit.mukind IS NOT NULL
GROUP BY mapunit.mukey
) AS a
ON a.mukey = mapunit.mukey
GROUP BY mapunit.mukind ;
Query Results
mukind | avg | min | max
------------------------+-----+-----+-----
Consociation | 1 | 1 | 3
Complex | 2 | 1 | 4
Association | 3 | 1 | 4
Undifferentiated group | 2 | 1 | 4
Method 1: Filtering component percentages with the DISTINCT keyword.
- Use of the SELECT DISTINCT ON operator. Note that the ordering is done before the DISTINCT filter is applied, resulting in the largest (by component percentage) component. Note that results from ties are unpredictable.
-
SELECT DISTINCT ON (mukey) mukey, cokey, comppct_r, compname, taxorder
FROM component
WHERE areasymbol = 'ca654' AND majcompflag = 'Yes'
ORDER BY mukey, comppct_r DESC ;
Query Results
mukey | cokey | comppct_r | compname | taxorder
---------+----------------+-----------+------------+-----------
1487066 | 1487066:638252 | 80 | Auberry | Alfisols
1487067 | 1487067:638274 | 80 | Auberry | Alfisols
1487068 | 1487068:638383 | 85 | Crouch | Mollisols
1487069 | 1487069:632318 | 85 | Cajon | Entisols
1487070 | 1487070:632372 | 85 | Excelsior | Entisols
1487071 | 1487071:632519 | 85 | Kimberlina | Entisols
1487072 | 1487072:632642 | 85 | Nord | Mollisols
1487076 | 1487076:632832 | 85 | Wasco | Entisols
1487077 | 1487077:632866 | 85 | Whitewolf | Entisols
464171 | 464171:640646 | 85 | Academy | Alfisols
- We can inspect possible ties or other sources of error by looking at the smallest components.
SELECT * FROM
(
SELECT DISTINCT ON (mukey) mukey, cokey, comppct_r, compname, taxorder
FROM component
WHERE areasymbol = 'ca654' AND majcompflag = 'Yes'
ORDER BY mukey, comppct_r DESC
) AS a
ORDER BY a.comppct_r ASC ;
Query Results
mukey | cokey | comppct_r | compname | taxorder
---------+----------------+-----------+---------------------+-------------
464432 | 464432:1380237 | 10 | Ramona | Alfisols
464184 | 464184:640686 | 30 | Ahwahnee | Alfisols
464211 | 464211:640756 | 30 | Auberry | Alfisols
464210 | 464210:640752 | 35 | Auberry | Alfisols
464529 | 464529:641600 | 35 | Vista | Inceptisols
464530 | 464530:641604 | 35 | Vista | Inceptisols
464531 | 464531:641609 | 35 | Rock outcrop |
Finding the largest component (based on comppct_r) per mukey
- Identify how many map units there are in a given region or survey area. In this case San Joaquin County; areasymbol = 'ca077'.
SELECT count(mukey) FROM mapunit WHERE areasymbol = 'ca113';
Query Results
count
-------
186
- Identify which map units contain components where a conventional, select largest component percentage type approach will not work. This commonly happens when there are several major components with the same comppct_r value, within the same map unit. Any returned mukey values need to be evaluated.
SELECT component.mukey, count(component.mukey)
FROM component
JOIN
(
SELECT mukey, max(comppct_r) AS comppct FROM component WHERE majcompflag = 'Yes' AND areasymbol = 'ca077' GROUP BY mukey
) AS a
ON component.comppct_r = a.comppct AND component.mukey = a.mukey
GROUP BY component.mukey
HAVING count(component.mukey) > 1 ;
Query Results
mukey | count
--------+-------
462043 | 2
462068 | 2
- Evaluation of the two "trouble" map units identified above. Note the use of the where mukey in ('462043', '462068') syntax.
SELECT mukey, cokey, comppct_r, compname, taxorder
FROM component
WHERE majcompflag = 'Yes' AND mukey IN ('462043', '462068')
ORDER BY mukey, comppct_r DESC ;
Query Results
mukey | cokey | comppct_r | compname | taxorder
--------+---------------+-----------+------------+-----------
462043 | 462043:634006 | 45 | Dumps |
462043 | 462043:634007 | 45 | Tailings |
462068 | 462068:634159 | 30 | Honker | Alfisols
462068 | 462068:634160 | 30 | Vallecitos | Alfisols
462068 | 462068:634161 | 25 | Gonzaga | Mollisols
Profile water storage as calculated from SSURGO
Overview
This example illustrates a component-percent weighted mean water storage, based on total profile storage values. The comppct_r column is used as a weight to adjust the influence of the profile water storage values for each component: larger weights (larger components) result in more influence. When performing weighted means, use care in the selection of an appropriate weighting parameter. A selection of a weighting variable not direclty related to the property which is being averaged can result in an effect known as Simpson's paradox.
Some background on Soil Water Holding Capacity and Irrigation Management.
The general formula for computing profile water storage is defined as follows. Note that the awc_r column in the SSURGO database (chorizon table) is pre-adjusted to compensate for coarse fragments (see relevant SSURGO metadata below).
Calculate the water storage within a given component
component_whc = profile_sum_awc = sum( (hzdepb_r - hzdept_r) * awc_r )
Calculate the weighted mean value of the profile water storage, for a given map unit
mapunit_whc = sum( (comppct_r/100) * component_whc ) / sum( comppct_r/100 )
From the SSURGO 2.1 table definitions:
Column Label: AWC - Representative Value (awc_r)
The amount of water that an increment of soil depth, inclusive of fragments, can store that is available to plants. AWC is expressed as a volume fraction, and is commonly estimated as the difference between the water contents at 1/10 or 1/3 bar (field capacity) and 15 bars
(permanent wilting point) tension and adjusted for salinity, and fragments.
-
Reduce to 1:1 relationship with cokey by aggrigating AWC horizon data in the chorizon table, and linking to the component table.
SELECT mukey, compname, comppct_r, a.* FROM component
JOIN
(
SELECT cokey, sum( (hzdepb_r - hzdept_r) * awc_r) AS component_whc, sum((hzdepb_r - hzdept_r)) AS depth
FROM chorizon WHERE areasymbol = 'ca113'
GROUP BY cokey
) AS a
ON component.cokey = a.cokey
WHERE component.areasymbol = 'ca113'
ORDER BY mukey ;
Query Results
mukey | compname | comppct_r | cokey | component_whc | depth
--------+---------------------+-----------+----------------+---------------+-------
459225 | Millsholm | 30 | 459225:624008 | 6.82 | 58
459225 | Dibble | 55 | 459225:624007 | 12.36 | 86
459226 | Millsholm | 40 | 459226:624013 | 6.82 | 58
459226 | Dibble | 45 | 459226:624012 | 12.36 | 86
459227 | Millsholm | 30 | 459227:624019 | 6.82 | 58
459227 | Dibble | 60 | 459227:624018 | 12.36 | 99
...
- Reduce to 1:1 relationship with mukey by aggreigating based on component percent:
SELECT component.mukey, round( (sum((component.comppct_r::float/100) * a.component_whc) / sum((component.comppct_r::float/100)))::float ) AS comppct_weighted_whc_cm
FROM component
JOIN
(
SELECT cokey, sum( (hzdepb_r - hzdept_r) * awc_r) AS component_whc,
sum((hzdepb_r - hzdept_r)) AS depth
FROM chorizon
WHERE areasymbol = 'ca113'
GROUP BY cokey
) AS a
ON component.cokey = a.cokey AND component.areasymbol = 'ca113'
GROUP BY component.mukey ;
Query Results
mukey | comppct_weighted_whc_cm
--------+-------------------------
459225 | 10
459226 | 10
459227 | 11
...
-
Link to polygons and setup access permissions:
-- create the new table with both geometry and attributes
CREATE TABLE yolo_whc AS
SELECT mapunit_poly.ogc_fid, mapunit_poly.wkb_geometry AS wkb_geometry, b.mukey, b.comppct_weighted_whc_cm
FROM mapunit_poly
-- use a LEFT JOIN to include non-soil polygons in the result set
-- use a JOIN to ignore non-soil polygons
LEFT JOIN
(
SELECT component.mukey, round( (sum((component.comppct_r::float/100) * a.component_whc) / sum((component.comppct_r::float/100)))::float ) AS comppct_weighted_whc_cm
FROM component
JOIN
(
SELECT cokey, sum( (hzdepb_r - hzdept_r) * awc_r) AS component_whc,
sum((hzdepb_r - hzdept_r)) AS depth
FROM chorizon
WHERE areasymbol = 'ca113'
GROUP BY cokey
) AS a
ON component.cokey = a.cokey AND component.areasymbol = 'ca113'
GROUP BY component.mukey
) AS b
ON mapunit_poly.mukey = b.mukey AND b.mukey IS NOT NULL
WHERE mapunit_poly.areasymbol = 'ca113' ;
Create indexes and register the new geometry:
-- create attribute and spatial index:
CREATE UNIQUE INDEX yolo_whc_idx ON yolo_whc (ogc_fid) ;
CREATE INDEX yolo_whc_spatial_idx ON yolo_whc USING gist (wkb_geometry gist_geometry_ops);
-- register in geometry_columns table:
INSERT INTO geometry_columns VALUES ('','public','yolo_whc','wkb_geometry',2,9001,'POLYGON');
-- optional:
-- fix permissions
-- GRANT SELECT on table yolo_whc to [user] ;
-- cleanup
vacuum analyze yolo_whc ;
Yolo County Water Holding Capacity Map: Profile total water holding capacity as computed by component percentage weighted average.
-
Dump to shapefile with ogr2ogr:
ogr2ogr ywhc.shp PG:"dbname=ssurgo_combined user=xxxx password=xxxx" yolo_whc
More examples of exporting data can be found here.
Soil Texture Example
Premise
Compute a series of weighted-average soil texture fractions (sand, silt, clay), for every component, of every map unit in Yolo County. These values will be further weighted by the spatial distribution of each map unit.
CREATE TABLE yolo_wt_mean_texture AS
-- join with polygons, and compute areas weights
SELECT mapunit_poly.mukey,
sum(ST_Area(wkb_geometry)) / (SELECT ST_Area(wkb_geometry) FROM mapunit_bound_poly WHERE areasymbol = 'ca113') AS area_wt,
sand, silt, clay
FROM
mapunit_poly
JOIN
(
-- compute component percent weighted mean
SELECT mukey,
sum(comppct_r * sand) / sum(comppct_r) AS sand,
sum(comppct_r * silt) / sum(comppct_r) AS silt,
sum(comppct_r * clay) / sum(comppct_r) AS clay
FROM
component
JOIN
(
-- compute hz thickness weighted mean
SELECT cokey,
sum((hzdepb_r - hzdept_r) * sandtotal_r) / sum(hzdepb_r - hzdept_r) AS sand,
sum((hzdepb_r - hzdept_r) * silttotal_r) / sum(hzdepb_r - hzdept_r) AS silt,
sum((hzdepb_r - hzdept_r) * claytotal_r) / sum(hzdepb_r - hzdept_r) AS clay
FROM chorizon
WHERE sandtotal_r IS NOT NULL
AND silttotal_r IS NOT NULL
AND claytotal_r IS NOT NULL
AND areasymbol = 'ca113'
GROUP BY cokey
) AS co_agg
ON component.cokey = co_agg.cokey
GROUP BY component.mukey
) AS mu_agg
ON mapunit_poly.mukey = mu_agg.mukey
GROUP BY mapunit_poly.mukey, sand, silt, clay;
Simple Visualization with R
# dump the data:
# echo 'select * from yolo_wt_mean_texture' | psql -A -F "," -U xxx -h xxx dbname > temp/yolo_texture.csv
#
# remember to trim off the last line!
# load some libs:
library(plotrix)
# read in the data
x <- read.csv('yolo_texture.csv')
# simple soil texture, with symbol size weighted by area weight
soil.texture(x[,3:5], cex=sqrt(50*x$area_wt), pch=16, col.symbol=rgb(65,105,225, alpha=100, max=255),
show.lines=T, show.names=T, col.lines='black', col.names='black', main='Yolo County Soil Textures')
triax.points(cbind(weighted.mean(x$sand, x$area_wt), weighted.mean(x$silt, x$area_wt), weighted.mean(x$clay, x$area_wt)),
col.symbols='orange', pch=16, cex=2)
Yolo County Common Soil Textures: Blue symbols mark common soil textures and their relative size denotes abundance. The orange symbol marks the weighted average soil texture for all of Yolo County.
Identification of Dated Surfaces via Soil Series
East Side Alluvial Formations
Overview
A simple association between dated landforms and soil series name [1] was used to extract soil polygons from a composite soil survey database.
| Soil Series |
Associated Formation |
Approximate Age (1000 yrs ago) |
| Redding |
Laguna |
1600 - 2000 kya |
| Corning |
Laguna |
1600 - 2000 kya |
| Keyes |
Laguna |
1600 - 2000 kya |
| Whitney |
Turlock Lake |
500 - 700 kya |
| Montpellier |
Turlock Lake |
500 - 700 kya |
| Rocklin |
Turlock Lake |
500 - 700 kya |
| Snelling |
Riverbank |
100 - 300 kya |
| San Joaquin |
Riverbank |
100 - 300 kya |
| Exiter |
Riverbank |
100 - 300 kya |
| Madera |
Riverbank |
100 - 300 kya |
| Hanford |
Modesto |
10 - 40 kya |
| Grangeville |
Holocene |
< 10 kya |
-
Create a soil series - dated landform lookup table:
-- create a lookup table
CREATE TABLE dated_landforms (
soil_series varchar(20),
formation varchar(20),
approx_age varchar(30)
);
-- populate table
INSERT INTO dated_landforms VALUES ('Redding','Laguna','1600 - 2000 kya') ;
INSERT INTO dated_landforms VALUES ('Corning','Laguna','1600 - 2000 kya') ;
INSERT INTO dated_landforms VALUES ('Keyes','Laguna','1600 - 2000 kya') ;
INSERT INTO dated_landforms VALUES ('Whitney','Turlock Lake','500 - 700 kya') ;
INSERT INTO dated_landforms VALUES ('Montpellier','Turlock Lake','500 - 700 kya') ;
INSERT INTO dated_landforms VALUES ('Rocklin','Turlock Lake','500 - 700 kya') ;
INSERT INTO dated_landforms VALUES ('Snelling','Riverbank','100 - 300 kya') ;
INSERT INTO dated_landforms VALUES ('San Joaquin','Riverbank','100 - 300 kya') ;
INSERT INTO dated_landforms VALUES ('Exiter','Riverbank','100 - 300 kya') ;
INSERT INTO dated_landforms VALUES ('Madera','Riverbank','100 - 300 kya') ;
INSERT INTO dated_landforms VALUES ('Hanford','Modesto','10 - 40 kya') ;
INSERT INTO dated_landforms VALUES ('Grangeville','Holocene','< 10 kya') ;
-
Select map units (mukey) by suitible series concepts, associated with major components. Note that the DISTINCT ON (mukey) ... ORDER BY mukey, comppct_r DESC pattern can be used to select the largest component for each map unit. Pattern matching is used to safely join variants with the soil series names in our look-up table: i.e. 'San Joaquin variant' will match 'San Joaquin'.
-- keep only the largest formation (by total component percent) within each map unit key
SELECT DISTINCT ON (mukey) mukey, formation, sum(comppct_r) AS formation_pct
FROM component
JOIN
dated_landforms
-- use pattern matching to also include variants
ON compname ~~* lower(soil_series || '%') = 't'
-- subset to select survey areas on the east side of the valley
AND component.areasymbol IN ('ca654', 'ca651', 'ca649', 'ca644', 'ca648', 'ca077')
AND majcompflag = 'Yes'
-- combine components of the same formation
GROUP BY mukey, formation
-- use in conjunction with the DISTINCT statement to keep only the largest
ORDER BY mukey, formation_pct DESC;
-
Create a new classified table called east_side_all. This query involves 6 survey areas, 11423 polygons, and requires about 11 seconds to complete.
CREATE TABLE east_side_all AS
-- select geom column and the feature id
-- along with our formation names
SELECT wkb_geometry AS wkb_geometry, ogc_fid, a.*
FROM mapunit_poly
JOIN
(
SELECT DISTINCT ON (mukey) mukey, formation, sum(comppct_r) AS formation_pct
FROM component
JOIN
dated_landforms
ON compname ~~* lower(soil_series || '%') = 't'
AND component.areasymbol IN ('ca654', 'ca651', 'ca649', 'ca644', 'ca648', 'ca077')
AND majcompflag = 'Yes'
GROUP BY mukey, formation
ORDER BY mukey, formation_pct DESC
) AS a
ON mapunit_poly.mukey = a.mukey
-- limit polygon selection by survey area
AND mapunit_poly.areasymbol IN ('ca654', 'ca651', 'ca649', 'ca644', 'ca648', 'ca077') ;
-
Index, register geometry, and setup permissions
-- indexing
CREATE INDEX east_side_all_fid_idx ON east_side_all (ogc_fid) ;
CREATE INDEX east_side_all_gis_idx ON east_side_all USING gist (wkb_geometry gist_geometry_ops) ;
-- register geometry
INSERT INTO geometry_columns VALUES ('','public','east_side_all','wkb_geometry',2,9001,'POLYGON');
-- permissions
GRANT SELECT ON TABLE east_side_all TO soil ;
-
Tabulate acreage for each formation. (approximately 1/3 second to complete)
SELECT * FROM
(
SELECT round(sum(area(wkb_geometry)) * 0.000247) AS area_ac, formation
FROM east_side_all
GROUP BY formation
) AS a
ORDER BY a.area_ac DESC;
| area_ac |
formation |
| 295868 |
Riverbank |
| 253424 |
Modesto |
| 151149 |
Turlock Lake |
| 121085 |
Laguna |
| 96981 |
Holocene |
(5 rows)
-
Dump to local file in SHP format
ogr2ogr east_side_all.shp PG:"dbname=ssurgo_combined user=xxxx password=xxxx host=xxxx" east_side_all
References:
-
Smith, D.W. & Verrill, W.L. Witham, C.; Bauder, E.; Belk, D.; Ferren Jr., W. & Ornduff, R. (ed.) Vernal Pool-Soil-Landform Relationships in the Central Valley, California 1998, 15-23
Seasonally Wet Soils and Shrink-Swell Potential
Example 1 map
Example 2 map
Example 1: The location of seasonaly wet soils via two methods: hydricrating and USDA Soil Taxonomy interpretation.
-- optionally link polygons here...
-- compute the percent of each map unit that contains components that may be seasonally wet
SELECT mukey, wet_flag, sum(comppct_r) AS pct_mu
FROM
(
SELECT mukey, cokey, comppct_r,
-- slightly less restrictive than hydricrating alone
CASE WHEN ((taxclname ~~* '%aqu%') = 't') OR (hydricrating = 'Yes') then 1 ELSE 0 END AS wet_flag
FROM component
-- subset to Yolo County
WHERE areasymbol = 'ca113'
) AS yolo_wet_components
-- only keep map unit with some component that meets the criteria
WHERE wet_flag = 1
-- aggregate by map unit, wet_flag
GROUP BY yolo_wet_components.mukey, yolo_wet_components.wet_flag
ORDER BY mukey;
Example 1.1 Extract a list of wet components, and sum area based on component (series) name
-- compute estimated, total acreage of seasonally wet map units in Yolo Co.
SELECT mapunit.musym, mapunit.mukey, mapunit.muname,
round(sum(ST_Area(wkb_geometry) * (yolo_mu_data.pct_mu::numeric/100.0) * 0.000247)::numeric, 2) AS wet_area_ac
FROM
mapunit_poly
JOIN
(
-- compute the percent of each map unit that contains components that may be seasonally wet
SELECT mukey, wet_flag, sum(comppct_r) AS pct_mu
FROM
(
SELECT mukey, cokey, comppct_r,
-- slightly less restrictive than hydricrating alone
CASE WHEN ((taxclname ~~* '%aqu%') = 't') OR (hydricrating = 'Yes') then 1 ELSE 0 END AS wet_flag
FROM component
-- subset to Yolo County
WHERE areasymbol = 'ca113'
) AS yolo_wet_components
-- only keep map unit with some component that meets the criteria
WHERE wet_flag = 1
-- aggregate by map unit, wet_flag
GROUP BY yolo_wet_components.mukey, yolo_wet_components.wet_flag
) AS yolo_mu_data
ON mapunit_poly.mukey = yolo_mu_data.mukey
JOIN mapunit
ON mapunit.mukey = yolo_mu_data.mukey
-- aggregate by map unit
GROUP BY mapunit.mukey, mapunit.muname, mapunit.musym
ORDER BY wet_area_ac DESC;
musym | mukey | muname | wet_area_ac
-------+--------+----------------------------------------------------------------+-------------
Sc | 459268 | Sacramento clay | 35710.90
Mf | 459244 | Marvin silty clay loam | 18877.18
Sg | 459272 | Sacramento soils, flooded | 12273.51
Cn | 459219 | Clear Lake soils, flooded | 11665.84
Cc | 459216 | Capay soils, flooded | 11029.83
Sv | 459288 | Sycamore complex, drained | 8410.41
St | 459286 | Sycamore silty clay loam, drained | 6900.77
Ck | 459218 | Clear Lake clay | 6737.88
Sa | 459266 | Sacramento silty clay loam | 6014.48
Sp | 459283 | Sycamore silt loam, drained | 5537.18
Sw | 459289 | Sycamore complex, flooded | 5041.96
Wb | 459301 | Willows clay | 4950.34
Ss | 459285 | Sycamore silty clay loam | 4859.59
Pb | 459254 | Pescadero silty clay, saline-alkali | 4700.77
So | 459282 | Sycamore silt loam | 3953.42
Rh | 459262 | Riverwash | 3698.88
Pc | 459255 | Pescadero soils, flooded | 3589.01
Tb | 459292 | Tyndall very fine sandy loam | 3299.03
Ob | 459252 | Omni silty clay | 3174.44
[...]
Example 2: Classify soils according to shrink-swell capacity of the top 1 meter of soil, weighted by horizon thickness and component percent.
-- set a lower boundary for the query
\SET lwr_bdy 100
-- add a class label, based on NRCS guidelines
SELECT mapunit.musym, mapunit.muname, mapunit.muacres,
round(mu_wt_lep::numeric, 2) AS lep,
CASE WHEN mu_wt_lep < 3 THEN 'Low'
WHEN mu_wt_lep >= 3 AND mu_wt_lep < 6 THEN 'Moderate'
WHEN mu_wt_lep >= 6 AND mu_wt_lep < 9 THEN 'High'
WHEN mu_wt_lep >= 9 THEN 'Very High'
END AS lep_class
FROM
(
-- compute map unit lep, weighted by component percent, to set depth
SELECT component.mukey,
sum(component.comppct_r * co_wt_mean_lep) / sum(component.comppct_r) AS mu_wt_lep
FROM
(
-- compute a horizon-thickness weighted mean lep to a set depth
SELECT cokey, sum(thick * lep_r) / sum(thick) AS co_wt_mean_lep
FROM
(
-- compute horizon thickness, but only to a set depth
SELECT cokey, hzdept_r, hzdepb_r, lep_r,
CASE WHEN hzdepb_r > :lwr_bdy THEN (:lwr_bdy - hzdept_r)
ELSE (hzdepb_r - hzdept_r) END AS thick
FROM chorizon
WHERE areasymbol = 'ca113'
AND lep_r IS NOT NULL
AND hzdept_r <= :lwr_bdy
) AS hz_lep
GROUP BY cokey
) AS co_lep
JOIN component
ON co_lep.cokey = component.cokey
GROUP BY mukey
) AS mu_lep
JOIN mapunit
ON mu_lep.mukey = mapunit.mukey
ORDER BY muacres DESC, lep DESC;
musym | muname | muacres | lep | lep_class
-------+------------------------------------------------------------------+---------+------+-----------
Ya | Yolo silt loam | 39698 | 2.52 | Low
Sc | Sacramento clay | 34886 | 7.50 | High
Ca | Capay silty clay | 33465 | 7.50 | High
MrG2 | Millsholm rocky loam, 15 to 75 percent slopes, eroded | 30118 | 1.50 | Low
Rg | Rincon silty clay loam | 24580 | 6.36 | High
BrA | Brentwood silty clay loam, 0 to 2 percent slopes | 23045 | 7.50 | High
CtD2 | Corning gravelly loam, 2 to 15 percent slopes, eroded | 22080 | 5.34 | Moderate
Mf | Marvin silty clay loam | 20970 | 6.60 | High
DaF2 | Dibble clay loam, 30 to 50 percent slopes, eroded | 18612 | 7.11 | High
SmE2 | Sehorn-Balcom complex, 15 to 30 percent slopes, eroded | 17794 | 6.17 | High
TaA | Tehama loam, 0 to 2 percent slopes | 16622 | 3.75 | Moderate
BdF2 | Balcom-Dibble complex, 30 to 50 percent slopes, eroded | 16405 | 5.73 | Moderate
SmD | Sehorn-Balcom complex, 2 to 15 percent slopes | 16117 | 6.50 | High
BaF2 | Balcom silty clay loam, 30 to 50 percent slopes, eroded | 12637 | 4.50 | Moderate
Sg | Sacramento soils, flooded | 12258 | 6.27 | High
Cn | Clear Lake soils, flooded | 11666 | 6.92 | High
SmF2 | Sehorn-Balcom complex, 30 to 50 percent slopes, eroded | 11226 | 6.33 | High
Cc | Capay soils, flooded | 11030 | 7.50 | High
Sv | Sycamore complex, drained | 9241 | 4.18 | Moderate
Ms | Myers clay | 8938 | 7.50 | High
PfF2 | Positas gravelly loam, 30 to 50 percent slopes, eroded | 7920 | 5.34 | Moderate
St | Sycamore silty clay loam, drained | 7839 | 4.50 | Moderate
Ck | Clear Lake clay | 6946 | 7.50 | High
Ra | Reiff very fine sandy loam | 6847 | 1.50 | Low
[...]
Making Database Diagrams with PostgreSQL Autodoc
http://www.rbt.ca/autodoc/index.html
postgresql_autodoc -d db_name
see dot output.
dot -Tpng -o output.png db_name.dot
GDAL and OGR: geodata conversion and re-projection tools
Overview
Operations on geographic data are most efficient when the input files have identical spatial parameters: i.e. coordinates system and datum / ellispoid. The GDAL/OGR tools and libraries can be used to "reproject" downloaded geographic data from one coordinate system to another, both for vector and raster data. More information on this software can be found here: http://www.remotesensing.org/gdal/index.html The following examples illustrate some of the ways in which the GDAL/OGR tools can be used to manipulate the spatial parameters of some common types of geographic data: SSURGO (ESRI Shapefile format) and aerial imagery (GeoTiff format).
How Does it Compare with ArcToolbox?
A SSURGO boundary polygon was used in a quick test to see how similar the results of a vector re-projection done by ArcMap and OGR were. The input vector (GCS NAD83) was projected to UTM zone 11 NAD83 by both OGR and ArcMap 9.0, and then imported into GRASS. Both vectors were then exported as ascii, for simple access to the line vertex coordinates. A simple AWK script was used to numerically compare the differences in both X and Y coordinates:
paste arcmap.ascii ogr.ascii | \
awk '
/\./{print ($1-$3), ($2-$4) ; x+=($1-$3) ; y+=($2-$4); n++ }
END{print "avg dx: "(x/n), "avg dy: "(y/n), "n: "n}
'
The resulting average dx and dy values were: dx: 2.98838e-06 m and dy: -1.23948e-07 m with 5035 samples: looks good to me. Next time, a datum shift may be more telling of any subtle differences to be aware of. Update: NAD83 -> NAD27 Datum shift experiment looks like datum shifts are comparible as well.
Examples:
Command names are in boldface type, input files are red, and output files are in green. Figures are provided for additional context, and are referenced by input/output file name in the example code.
ssurgo_utm.shp
ssurgo_ScA.shp
Vector Operations
- Get information about a shapefile:
ogrinfo -al ssurgo_geo.shp | less
- Reproject SSURGO mapunit data from geographic coordinates (Lat/Lon) to UTM zone 10:
ogr2ogr -t_srs "+proj=utm +zone=10 +datum=NAD83" ssurgo_utm.shp ssurgo_geo.shp
- Extract all polygons from a SSURGO shapefile where the mapunit symbol is 'ScA' :
ogr2ogr -where "musym = 'ScA' " ssurgo_ScA.shp ssurgo_utm.shp
- Convert a shapefile into Google-friendly KML:
ogr2ogr -f "KML" -t_srs EPSG:4326 trails.kml trails.shp

884084.tif

884084-utm.tif
Raster Operations [raster tutorial]
- Get information about a raster dataset
gdalinfo rasterfile.tiff
- Reproject an aerial photo in CA State Plane Zone 4 (Lambert Conformal Conic projection, units = feet) to UTM Zone 10 (units = meters), and rescale to 1 meter output resolution, and use thin-plate-spline resampling (-tps):
gdalwarp -tps -t_srs '+proj=utm +zone=10 +datum=NAD83 +units=m' \
-s_srs '+proj=lcc +lat_1=36.0000 +lat_2=37.2500 +lat_0=35.333333333333336 \
+lon_0=-119.0000 +x_0=2000000 +y_0=500000 +datum=NAD83 +units=ft' \
-srcnodata 255 -dstnodata 255 -tr 1 1 884084.tif 884084-utm.tif
- Note that it is usually a good idea to "optimise" the resulting image with gdal_translate.
- Convert Multi-band GeoTiff file to JPEG:
gdal_translate -of JPEG 884084-utm.tif 884084-utm.jpg
Reading MrSid format files details
- Add the DSDK path to configure: --with-mrsid=/Geo_DSDK-6.0.7.1407/
- There is a bug in Geo_DSDK-6.0.7.1407//include/base/lti_sceneBuffer.h line 356:
bool LTISceneBuffer::inWindow(lt_uint32 x, lt_uint32 y) const; // <--- line 356
Comment out this line, and re-compile. Thanks to Mateusz Loskot on the GDAL IRC channel for pointing this out.
-
gdal_translate file.sid file.tif
R: advanced statistical package
About R
R is a general-purpose, command-line based, environment for working with data. R is based on a functional approach to working with vectors and matrices of data. R is a convenient environment for processing, analyzing, and plotting data. The statistical tests and model building facilities of R are rivaled by no other single application.
R in the News
Getting Started
Searching for Information
R with Geographic Data
Misc. Articles
Access Data Stored in a Postgresql Database
Overview
Perform some temporal aggregation (by day and by week) of the amount of data entry completed in Postgresql, and plot the results in R. See resulting figure at the bottom of the page. Note that this requires the Rdbi and RdbiPgSQL packages. Hints on installing these packages can be found on this page...
Weekly Aggregation hints from the psql manual page
SELECT week, count(week) AS entered
FROM
(
SELECT pedon_id, creation_date, extract( week FROM creation_date) AS week
FROM description
ORDER BY creation_date ASC
) AS a
GROUP BY a.week
ORDER BY week;
Daily Aggregation hints from the psql manual page
SELECT doy, count(doy) AS entered
FROM
(
SELECT pedon_id, creation_date, extract( doy FROM creation_date) AS doy
FROM description
ORDER BY creation_date ASC
) AS a
GROUP BY a.doy
ORDER BY doy;
R Example
##### load the samme data in from PgSQL #####
library(Rdbi)
library(RdbiPgSQL)
# conn becomes an object which contains the DB connection:
conn <- dbConnect(PgSQL(), host="localhost", dbname="xxx", user="xxx", password="xxx")
# see if the connection works (should report the list of table(s) if table(s) are existing):
# dbListTables(conn)
## create an object which contains the SQL query:
query <- dbSendQuery(conn, "select pedon_id, hz_number, name, top, bottom, ((bottom - top)/2 + top) as avgdepth, matrix_wet_color_hue, matrix_wet_color_value, matrix_wet_color_chroma, matrix_dry_color_hue, matrix_dry_color_value, matrix_dry_color_chroma from horizon order by pedon_id, hz_number ")
# fetch data according to query:
x <- dbGetResult(query)
# create an object which contains the SQL query:
query <- dbSendQuery(conn, "select doy, count(doy) as entered from (select pedon_id, creation_date, extract( doy from creation_date) as doy from description order by creation_date asc ) as a group by a.doy order by doy;")
# fetch data according to query:
y <- dbGetResult(query)
# setup plot environment
par(mfrow=c(2,1))
# plot cumulative progress, by week
plot(x$week, cumsum(x$entered), type='b', xlab='Week', ylab='Pedon Forms Completed', main='Weekly Progress')
# plot cumulative progress, by day
plot(y$doy, cumsum(y$entered), type='b', xlab='Day of Year', ylab='Pedon Forms Completed', main='Daily Progress')
Pedon entry progress
Additive Time Series Decomposition in R: Soil Moisture and Temperature Data

Decagon Sensors: EC-5 (moisture) and ECT (temperature)
Premise
Simple demonstration of working with time-series data collected from Decagon Devices soil moisture and temperature sensors. These sensors were installed in a potted plant, that was semi-regularly watered, and data were collected for about 80 days on an hourly basis. Several basic operations in R are demonstrated:
- reading raw data in CSV format
- converting date-time values to R's date-time format
- applying a calibration curve to raw sensor values
- initialization of R time series objects
- seasonal decomposition of additive time series (trend extraction)
- plotting of decomposed time series, ACF, and cross-ACF
Process the raw sensor values with standard calibrations
## data from office plant: in potting soil
# raw data dump -- need to convert datetime + values:
x1 <- read.csv('office_plant_2.csv', head=FALSE)
# datetime is seconds from jan 1st 2000
t_0 <- as.POSIXlt(strptime('2000-01-01 00:00:00', format='%Y-%m-%d %H:%M:%S'))
# calibration for potting soil
raw_to_vwc <- function(d) {vwc <- (d * 0.00119) - 0.401 ; vwc }
# calibration for deg C
raw_to_temp <- function(d) {t <- log( (4095/d) - 1 ) ; t_c <- 25.02 + t * (-22.84 + t * (1.532 + (-0.08372 * t))) ; t_c}
# convert values
y1 <- data.frame(date=t_0 + x1$V1, m=raw_to_vwc(x1$V2), t=raw_to_temp(x1$V5))
# make a nice time axis
d.range <- range(y1$date)
d.list <- seq(d.range[1], d.range[2], by='week')
# note that there are several tricks here:
# stacking two plots that share an axis
# customized x-axis
# and manually adding a title with mtext()
par(mar = c(0.5, 4, 0, 1), oma = c(3, 0, 4, 0), mfcol = c(2,1))
plot(m ~ date, data=y1, type='l', ylab='VWC (EC-5 Sensor)', xaxt='n', las=2, cex.axis=0.75)
plot(t ~ date, data=y1, type='l', ylab='Deg. C (EC-T Sensor)', xaxt='n', las=2, cex.axis=0.75)
axis.POSIXct(at=d.list, side=1, format="%b-%d", cex.axis=0.75)
mtext('Potted Plant Experiment', outer=TRUE, line=2, font=2)
# save copy of raw data
dev.copy2pdf(file='raw_data.pdf')
Decompose each time series into additive components
# look at components of time series:
# we recorded measurements once and hour, so lets consider these data a on a daily-cycle
temp.ts <- ts(y1$t, freq=24)
vwc.ts <- ts(y1$m, freq=24)
# decompose additive time series with STL
# (Seasonal Decomposition of Time Series by Loess)
temp.stl <- stl(temp.ts, s.window=24)
vwc.stl <- stl(vwc.ts, s.window=24)
# these are referenced by day, so we need a new index for
# plotting meaningful dates on the x-axis
# generate the difference in days, from the first observations, at each date label
date.day_idx <- as.numeric((d.list - d.range[1]) / 60 / 60 / 24)
# note special syntax
par(mar = c(0, 4, 0, 3), oma = c(5, 0, 4, 0), mfcol = c(4,1), xaxt='n')
plot(temp.stl , main='Temperature (deg C)')
mtext(at=date.day_idx, text=format(d.list, "%b-%d"), side=1, cex=0.75)
dev.copy2pdf(file='temperature-ts_plot.pdf')
# note special syntax
par(mar = c(0, 4, 0, 3), oma = c(5, 0, 4, 0), mfcol = c(4,1), xaxt='n')
plot(vwc.stl , main='Volumetric Water Content')
mtext(at=date.day_idx, text=format(d.list, "%b-%d"), side=1, cex=0.75)
dev.copy2pdf(file='vwc-ts_plot.pdf')
Additive Time Series Decomposition: Temperature
Additive Time Series Decomposition: Volumetric Water Content
Auto-Correlation Function (ACF)
# look at ACF: ind. time series, and cross-ACF
acf( ts.union(temp.ts, vwc.ts) )
# extract seasonal components from each sensor, union, and plot together
temp_vwc.ts <- ts.union(Temperature=temp.stl$time.series[,1], VWC=vwc.stl$time.series[,1])
plot(temp_vwc.ts, main='Seasonal Components', mar.multi= c(1, 5.1, 1, 1))
Soil Moisture and Temperature ACF: Auto-correlation function of each time series, and cross-ACF.
Interesting Results
Variation in temperature with time dominated by diurnal fluctuations superposed over underlying fluctuations caused by building heating/cooling system. The magnitude of the diurnal cycle appears to be related to the moisture content- as expected due to high heat capacity of water. Diurnal variation in moisture values appears to account for less than < 2% absolute change in volumetric water content.
Aggregating SSURGO Data in R
Premise
SSURGO is a digital, high-resolution (1:24,000), soil survey database produced by the USDA-NRCS. It is one of the largest and most complete spatial databases in the world; and is available for nearly the entire USA at no cost. These data are distributed as a combination of geographic and text data, representing soil map units and their associated properties. Unfortunately the text files do not come with column headers, so a template is required to make sense of the data. Alternatively, one can use an MS Access template to attach column names, generate reports, and other such tasks. CSV file can be exported from the MS Access database for further use. A follow-up post with text file headers, and complete PostgreSQL database schema will contain details on implementing a SSURGO database without using MS Access.
If you happen to have some of the SSURGO tabular data that includes column names, the following R code may be of general interest for resolving the 1:many:many hierarchy of relationships required to make a thematic map.
This is the format we want the data to be in
mukey clay silt sand water_storage
458581 20.93750 20.832237 20.861842 14.460000
458584 43.11513 30.184868 26.700000 23.490000
458593 50.00000 27.900000 22.100000 22.800000
458595 34.04605 14.867763 11.776974 18.900000
So we can make a map like this

Loading Data Into R
# need this for ddply()
library(plyr)
# load horizon and component data
chorizon <- read.csv('chorizon_table.csv')
# only keep some of the columns from the component table
component <- read.csv('component_table.csv')[,c('mukey','cokey','comppct_r')]
Function Definitions
# custom function for calculating a weighted mean
# values passed in should be vectors of equal length
wt_mean <- function(property, weights)
{
# compute thickness weighted mean, but only when we have enough data
# in that case return NA
# save indices of data that is there
property.that.is.na <- which( is.na(property) )
property.that.is.not.na <- which( !is.na(property) )
if( length(property) - length(property.that.is.na) >= 1)
prop.aggregated <- sum(weights[property.that.is.not.na] * property[property.that.is.not.na], na.rm=TRUE) / sum(weights[property.that.is.not.na], na.rm=TRUE)
else
prop.aggregated <- NA
return(prop.aggregated)
}
profile_total <- function(property, thickness)
{
# compute profile total
# in that case return NA
# save indices of data that is there
property.that.is.na <- which( is.na(property) )
property.that.is.not.na <- which( !is.na(property) )
if( length(property) - length(property.that.is.na) >= 1)
prop.aggregated <- sum(thickness[property.that.is.not.na] * property[property.that.is.not.na], na.rm=TRUE)
else
prop.aggregated <- NA
return(prop.aggregated)
}
# define a function to perfom hz-thickness weighted aggregtion
component_level_aggregation <- function(i)
{
# horizon thickness is our weighting vector
hz_thick <- i$hzdepb_r - i$hzdept_r
# compute wt.mean aggregate values
clay <- wt_mean(i$claytotal_r, hz_thick)
silt <- wt_mean(i$silttotal_r, hz_thick)
sand <- wt_mean(i$sandtotal_r, hz_thick)
# compute profile sum values
water_storage <- profile_total(i$awc_r, hz_thick)
# make a new dataframe out of the aggregate values
d <- data.frame(cokey=unique(i$cokey), clay=clay, silt=silt, sand=sand, water_storage=water_storage)
return(d)
}
mapunit_level_aggregation <- function(i)
{
# component percentage is our weighting vector
comppct <- i$comppct_r
# wt. mean by component percent
clay <- wt_mean(i$clay, comppct)
silt <- wt_mean(i$silt, comppct)
sand <- wt_mean(i$sand, comppct)
water_storage <- wt_mean(i$water_storage, comppct)
# make a new dataframe out of the aggregate values
d <- data.frame(mukey=unique(i$mukey), clay=clay, silt=silt, sand=sand, water_storage=water_storage)
return(d)
}
Performing the Aggregation
# aggregate horizon data to the component level
chorizon.agg <- ddply(chorizon, .(cokey), .fun=component_level_aggregation, .progress='text')
# join up the aggregate chorizon data to the component table
comp.merged <- merge(component, chorizon.agg, by='cokey')
# aggregate component data to the map unit level
component.agg <- ddply(comp.merged, .(mukey), .fun=mapunit_level_aggregation, .progress='text')
# save data back to CSV
write.csv(component.agg, file='something.csv', row.names=FALSE)
Cluster Analysis 1: finding groups in a randomly generated 2-dimensional dataset
Figure 1. Two class example
Figure 2: Four class example
Figure 3: 2-way fuzzy membership
Examples based on a random data set (see example code below), illustrating some of the differences between the K-means and C-means clustering methods as implemented in R. Next time an example with soil profile data collected from the Pinnacles National Monument soil survey efforts. An online version of the PINN soil survey will be available soon here.
Articles:
Example in R:
## load required packages:
require(cluster)
require(e1071)
## make a dateset with 5 populations
x <- matrix( c(
rnorm(50, mean=.3, sd=.5),
rnorm(50, mean=.16, sd=.1),
rnorm(50, mean=.4, sd=.3),
rnorm(50, mean=.6, sd=.2),
rnorm(50, mean=.2, sd=.2)
), ncol=2)
## load function membership() : see attached file at bottom of page
source('cluster_demo_function.R')
## run an example with 2, then 4 classes: See Figures 1 and 2
membership(x,2)
membership(x,4)
## two-way fuzzy membership illustrated with color: See Figure 3
## display 2-way fuzzy membership
plot(x, main="C-means: 2-way Fuzzy Membership", type="n", xlab="Variable 1", ylab="Variable 2")
points(cc$centers, col = c("red", "blue"), pch = 8, cex=2)
points(x, col = rgb(cc$membership[,1], 0 ,cc$membership[,2]) , cex=0.5, pch=16)
Color Functions
Sample functions and ideas for accessing the R built-in colors. Further examples on converting soil colors to RGB triplets, or for the selection of optimal colors for a thematic map please see the examples linked at the bottom of this page. An excellent discussion on the use of color for presenting scientific data is presented in this paper by Zeileis, Achim and Hornik, Kurt.
R Color Selection: Simple figure illustrating the layout() function to create a plot of the built-in R colors palettes.
Simple Color Display
#make a color wheel
pie(rep(1,12), col=rainbow(12))
#make a list of the common color palettes:
demo.pal <- function(n, border = if (n<32) "light gray" else NA,
main = paste("color palettes; n=",n),
ch.col = c("rainbow(n, start=.7, end=.1)", "heat.colors(n)",
"terrain.colors(n)", "topo.colors(n)", "cm.colors(n)"))
{
nt <- length(ch.col)
i <- 1:n; j <- n / nt; d <- j/6; dy <- 2*d
plot(i,i+d, type="n", yaxt="n", ylab="", main=main)
for (k in 1:nt) {
rect(i-.5, (k-1)*j+ dy, i+.4, k*j,
col = eval(parse(text=ch.col[k])), border = border)
text(2*j, k * j +dy/4, ch.col[k])
}
}
n <- if(.Device == "postscript") 64 else 16
# Since for screen, larger n may give color allocation problem
demo.pal(n)
A Queryable color picker (as suggested by Gabor Grothendieck on the R-help mailing list)
#make a color lookup function
getColorName <- function(colorNumber) colors()[colorNumber]
# pch = NA means no points plotted. pch = 20 plots small dots.
# n is the number of points to identify interactively with mouse
printColorSampler <- function(n = 0, pch = NA, bg = "white") {
i <- seq(colors())
k <- ceiling(sqrt(length(i)))
xy <- cbind(floor(i/k)*2, i %% k)
opar <- par(bg = bg)
on.exit(par = opar)
plot(xy, axes = FALSE, xlab = "", ylab = "", pch=pch, col=colors())
text(xy[,1]+.5, xy[,2]+.2, i, col = colors(), cex = 0.7)
if (n > 0)
colors()[identify(xy, n = n, labels = colors(), plot = FALSE)]
}
# test
printColorSampler(0)
printColorSampler(1)
printColorSampler(pch=20, bg="black")
Setup the plot layout, and plot both examples
#setup the layout, and print pane boundaries:
nf <- layout(matrix(c(1,1,2,2), 2, 2, byrow=FALSE), respect=TRUE, widths=c(1,2)) ; layout.show(nf)
#plot the pie chart:
pie(rep(1,12), col=rainbow(12))
#plot the palette chart:
demo.pal(n)
#save a copy to an EPS file:
dev.copy2eps()
Convert Munsell colors to computer-friendly RGB triplets
Figure 1: Munsell color chips.
Figure 2: Common soil colors.
Figure 3: Commom soil colors in RGB.
Figure 4: Soil colors in multiple color spaces
Figure 5: Soil profile colors.
The Munsell color system was designed as a series of discrete color chips which closely approximation to the color sensitivity of the human eye. The description of color via three variables tied to perceptible properties (hue, value, and chroma) under a standardized illuminant (sunlight on a clear day) makes the Munsell system a good choice for recording and interpreting soil color data. However, numerical analysis of colors encoded in the Munsell system is difficult because they are from a discrete set of color chips and referenced by values that include both letters and numbers. Rossel et. al. (2006) give a good background on various color space models and their relative usefulness in the realm of soil science. The conversion of Munsell soil colors to RGB triplets, suitable for displaying on a computer screen or printing, is made complicated by the numerous operations involved in converting between color spaces. Figure 1 shows all possible (both real and unreal) Munsell color chips in the L*U*V color space. Figure 2 shows some of the common soil color chips in the same color space. Figures 2 through 5 depict common soil colors in the RGB color space, visualized both in R and POVRAY. Example R code on the conversion is given below.
Munsell color data can be downloaded here.
Color conversion equations here.
References:
- Rossel, R.A.V.; Minasny, B.; Roudier, P. & McBratney, A.B. Colour space models for soil science Geoderma, 2006, 133, 320-337.
Manual Conversion in R
Setup environment and load lookup table data
## load some libs
library(plotrix)
library(colorspace)
## munsell data comes with a lookup table in xyY colorspace
## url: http://www.cis.rit.edu/mcsl/online/munsell.php
## note:
## Munsell chroma, CIE x, y, and Y. The chromaticity coordinates were calculated using illuminant C and the CIE 1931 2 degree observer.
all <- read.table("munsell-all.dat", header=T)
Convert xyY to XYZ [Equation Reference]
## x and y are approx (0,1)
## Y is approx (0,100)
## need manually rescale Y to (0,1)
all$Y <- all$Y/100.0
## do the conversion
X <- (all$x * all$Y ) / all$y
Y <- all$Y
Z <- ( (1- all$x - all$y) * all$Y ) / all$y
## combine to form matrix for simple manipulation
mun_XYZ_C <- matrix(c(X,Y,Z), ncol=3)
## test for y == 0
## X,Y,Z should then be set to 0
mun_XYZ_C[which(all$y==0),] <- c(0,0,0)
Perform Chromatic Adaption Functions in the colorspace package, and sRGB profiles assume a D65 illuminant [Reference]
## conversion matrix, from reference above
## this has been revised as of Jan, 2008
M_adapt_C_to_D65 <- matrix(c(0.990448, -0.012371, -0.003564, -0.007168, 1.015594, 0.006770, -0.011615, -0.002928, 0.918157), ncol=3, byrow=TRUE)
## perform the chromatic adaption: convert from C -> D65 using Bradford method
mun_XYZ_D65 <- mun_XYZ_C %*% M_adapt_C_to_D65
## how different are the two?
summary( (mun_XYZ_D65 - mun_XYZ_C) )
Convert XYZ (D65) to sRGB (D65), step 1 this assumes that XYZ is scaled to (0,1) [Reference Primaries for sRGB]
## first get the reference primaries transformation matrix from above
##
## sRGB profile transformation:
M_XYZ_to_sRGB_D65 <- matrix(c(3.24071, -0.969258, 0.0556352, -1.53726, 1.87599, -0.203996, -0.498571, 0.0415557, 1.05707), ncol=3, byrow=TRUE)
## apply the conversion matrix
mun_sRGB_D65 <- mun_XYZ_D65 %*% M_XYZ_to_sRGB_D65
Convert XYZ (D65) to sRGB (D65), step 2 (sRGB, gamma = 2.4) [Conversion Function to sRGB]
## define the transformation functions:
## these are applied on a conditional basis:
fun1 <- function(col_comp) { 1.055 * ( col_comp ^ ( 1 / 2.4 ) ) - 0.055 }
fun2 <- function(col_comp) { 12.92 * col_comp }
## the specific function is contingent on the absolute value of r,g,b components
R <- ifelse(mun_sRGB_D65[,1] > 0.0031308, fun1(mun_sRGB_D65[,1]), fun2(mun_sRGB_D65[,1]))
G <- ifelse(mun_sRGB_D65[,2] > 0.0031308, fun1(mun_sRGB_D65[,2]), fun2(mun_sRGB_D65[,2]))
B <- ifelse(mun_sRGB_D65[,3] > 0.0031308, fun1(mun_sRGB_D65[,3]), fun2(mun_sRGB_D65[,3]))
##clip values to range {0,1}
R_clip <- ifelse(R < 0, 0, R)
G_clip <- ifelse(G < 0, 0, G)
B_clip <- ifelse(B < 0, 0, B)
R_clip <- ifelse(R > 1, 1, R_clip)
G_clip <- ifelse(G > 1, 1, G_clip)
B_clip <- ifelse(B > 1, 1, B_clip)
## add these back to the original table:
all$R <- R_clip
all$G <- G_clip
all$B <- B_clip
## done with the conversion
## the manually converted data
plot( as(RGB(R_clip,G_clip,B_clip), 'LUV'), cex=0.5)
Using ColorBrewer to assist with thematic map color selection
RColorBrewer color combinations
Figure 2: 3 colors per combination
Figure 3: 9 colors per combination
Choosing the right colors for classes in a thematic map can be a difficult task. The ColorBrewer website provides an interactive tool for browsing numerous color combinations. Each of the color combinations presented on the ColorBrewer website are the culmination of numerous color interpretation studies. In addition, there is a list of special color combinations suitible for audiences which may include color blind individuals.
The R package RColorBrewer adds the color brewer color combinations as well as functions for generating new combinations to R. Figure 1 demonstrates the available color combinations, as returned by the function display.brewer.all.
An example R session:
#load the RColorBrewer package [must be installed first with install.packages()]
library(RColorBrewer)
#display the "sequential" color combinations, with 3 colors per combination
#See Figure 2
display.brewer.all(n=3,type="seq",exact.n=TRUE)
title("Sequential Color Combinations: 3 Colors per Combination")
#display the "sequential" color combinations, with 9 colors per combination
#See Figure 3
display.brewer.all(n=9,type="seq",exact.n=TRUE)
title("Sequential Color Combinations: 9 Colors per Combination")
# convert R colors into RGB triplets;
col2rgb( brewer.pal("Accent", n=5) )
Comparison of Slope and Intercept Terms for Multi-Level Model
Premise
When the relationship between two variable is (potentially) dependent on a third, categorical variable ANCOVA (analysis of covariance), or some variant, is commonly used. There are several approaches to testing for differences in slope/intercepts (in the case of a simple linear model) between levels of the stratifying variable. In R the following formula notation is usually used to test for interaction between levels of a factor (f) and the relationship between two continuous variables x and y: y ~ x * f. A simple graphical exploration of this type of model can be done through examination of confidence intervals computed for slope and intercept terms, for each level of our grouping factor (f). An example of a fictitious dataset is presented below. Note that this a rough approximation for testing differences in slope/intercept within a multi-level model. A more robust approach would take into account that we are trying to make several pair-wise comparisons, i.e. something akin to Tukey's HSD. Something like this can be done with the multcomp package. For any real data set you should always consult a real statistician.
Example Multi-Level Model: each panel represents a model fit to y ~ x, for group f
Example Multi-Level Data
# need this for xyplot()
library(lattice)
# make some fake data:
x <- rnorm(100, mean=3, sd=6)
y <- x * runif(100, min=1, max=7) + runif(100, min=1.8, max=5)
d <- data.frame(x, y, f=rep(letters[1:10], each=10))
# check it out
xyplot(y ~ x | f, data=d, type=c('p','r'))
Implementation
Example Multi-Level Model: Confidence Intervals: parameter estimates along with 95% confidence interval, within each level of our grouping factor (f).
Automated Plotting of Parameter Confidence Intervals
# split by factor
d.l <- split(d, d$f)
# fit model for each level of factor
fits <- lapply(d.l, function(d_i) {lm(y ~ x, data=d_i)})
# extract coefs
est <- lapply(fits, coef)
# compute confints
ci <- lapply(fits, confint)
ci.mat <- do.call('rbind', ci)
est.mat <- do.call('rbind', est)
ci.df <- data.frame(f=rep(colnames(sapply(ci, '[')), each=2))
ci.df$lower <- ci.mat[,1]
ci.df$upper <- ci.mat[,2]
# re-attach estimate label
ci.df$which <- row.names(ci.mat)
# add dummy column for estimate
ci.df$estimate <- NA
# make a data frame for the estimates
est.df <- data.frame(which=rep(colnames(est.mat), each=nrow(est.mat)))
est.df$estimate <- as.vector(c(est.mat[,1], est.mat[,2]))
est.df$f <- rep(row.names(est.mat), 2)
# add dummy columns for upper and lower conf ints
est.df$upper <- NA
est.df$lower <- NA
# combine estimate with confints
combined <- rbind(est.df, ci.df)
# combined plot of estimate +/- confint
dotplot(f ~ estimate + lower + upper | which, data=combined, scales=list(relation='free'), xlab="Estimate", ylab="Group", auto.key=list(columns=3),
par.settings=list(superpose.symbol=list(col=c(1), pch=c(16,1,1), cex=c(1,0.75,0.75))))
Formal Evaluation with lm()
The first two lines in the output below are testing the hypothesis that the slope and intercept term for level 'a' are not different than 0. Subsequent hypothesis tests are relative to the first 'level' in the dataset. In this case we are testing the hypothesis that intercept and slope terms for levels 'b' through 'j' are not different than the corresponding terms for level 'a'. From the output below we can see that none of the intercept terms (levels 'b' through 'j') are different than for 'a', and that the slope term for level 'd' is only marginally "different" (p=0.0625) than the slope term for 'a'.
Testing Model Terms
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.99570 4.10909 1.216 0.2276
x 4.40546 0.68230 6.457 7.68e-09 ***
fb -4.66364 7.28233 -0.640 0.5237
fc 1.10173 6.52890 0.169 0.8664
fd 1.51033 6.20212 0.244 0.8082
fe -5.28549 6.62921 -0.797 0.4276
ff -1.37673 6.39280 -0.215 0.8300
fg -7.69480 5.93011 -1.298 0.1982
fh -2.34349 5.70703 -0.411 0.6824
fi 1.14558 6.84805 0.167 0.8676
fj -1.12319 7.87523 -0.143 0.8869
x:fb 0.92661 0.94257 0.983 0.3285
x:fc 0.43454 1.04819 0.415 0.6796
x:fd -1.75956 0.93137 -1.889 0.0625 .
x:fe -0.08193 0.96216 -0.085 0.9323
x:ff -0.42669 0.99172 -0.430 0.6682
x:fg 0.57531 0.99279 0.579 0.5639
x:fh 1.63650 1.02319 1.599 0.1137
x:fi -0.38424 0.97753 -0.393 0.6953
x:fj -0.89373 1.14337 -0.782 0.4367
Comparison of Slope and Intercept Terms for Multi-Level Model II: Using Contrasts
Premise
Small update to a similar thread from last week, on the comparison of slope and intercept terms fit to a multi-level model. I finally figured out (thanks R-Help mailing list!) how to efficiently use contrasts in R. The C() function can be called within a model formula, to reset the base level of an un-ordered factor. The UCLA Stats Library has an extensive description of this topic here. This approach can be used to sequentially test for differences between slope and intercept terms from a multi-level model, by re-setting the base level of a factor. See example data and figure below.
Note that the multcomp package has a much more robust approach to this type of operation. Details below.
Example Multi-Level Data
# need these
library(lattice)
library(Design)
# replicate an important experimental dataset
set.seed(10101010)
x <- rnorm(100)
y1 <- x[1:25] * 2 + rnorm(25, mean=1)
y2 <- x[26:50] * 2.6 + rnorm(25, mean=1.5)
y3 <- x[51:75] * 2.9 + rnorm(25, mean=5)
y4 <- x[76:100] * 3.5 + rnorm(25, mean=5.5)
d <- data.frame(x=x, y=c(y1,y2,y3,y4), f=factor(rep(letters[1:4], each=25)))
# plot
xyplot(y ~ x, groups=f, data=d,
auto.key=list(columns=4, title='Beard Type', lines=TRUE, points=FALSE, cex=0.75),
type=c('p','r'), ylab='Number of Pirates', xlab='Distance from Land')
Example Multi-Level Model II
Default Contrasts (contr.treatment for regular factors, contr.poly for ordered factors)
# standard comparison to base level of f
summary(lm(y ~ x * f, data=d))
# output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0747 0.1889 5.689 1.51e-07 ***
x 1.9654 0.1799 10.927 < 2e-16 ***
fb 0.3673 0.2724 1.348 0.1808
fc 4.1310 0.2714 15.221 < 2e-16 ***
fd 4.4309 0.2731 16.223 < 2e-16 ***
x:fb 0.5951 0.2559 2.326 0.0222 *
x:fc 1.0914 0.2449 4.456 2.35e-05 ***
x:fd 1.3813 0.2613 5.286 8.38e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Setting the "base level" in the Model Formula This allows us to compare all slope and intercept terms to the slope and intercept from level 4 of our factor ('d' in our example).
# compare to level 4 of f
summary(lm(y ~ x * C(f, base=4), data=d))
# output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.5055 0.1972 27.911 < 2e-16 ***
x 3.3467 0.1896 17.653 < 2e-16 ***
C(f, base = 4)1 -4.4309 0.2731 -16.223 < 2e-16 ***
C(f, base = 4)2 -4.0635 0.2783 -14.603 < 2e-16 ***
C(f, base = 4)3 -0.2999 0.2773 -1.081 0.28230
x:C(f, base = 4)1 -1.3813 0.2613 -5.286 8.38e-07 ***
x:C(f, base = 4)2 -0.7862 0.2628 -2.992 0.00356 **
x:C(f, base = 4)3 -0.2899 0.2521 -1.150 0.25327
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Testing with Multcomp Package using data from above example
# need these
library(multcomp)
library(sandwich)
# open this vignette, lots of good information
vignette("generalsiminf", package = "multcomp")
# fit two models
l.1 <- lm(y ~ x + f, data=d)
l.2 <- lm(y ~ x * f, data=d)
# note that: tests are AGAINST the null hypothesis
summary(glht(l.1))
# see the plotting methods:
plot(glht(l.1))
plot(glht(l.2))
# pair-wise comparisons
summary(glht(l.1, linfct=mcp(f='Tukey')))
# pair-wise comparisons
# may not be appropriate for model with interaction
summary(glht(l.2, linfct=mcp(f='Tukey')))
# when variance is not homogenous between groups:
summary(glht(l.1, linfct=mcp(f='Tukey'), vcov=sandwich))
Computing Statistics from Poorly Formatted Data (plyr and reshape packages for R)
Premise
I was recently asked to verify the coefficients of a linear model fit to sets of data, where each row of the input file was a "site" and each column contained the dependent variable through time (i.e. column 1 = time step 1, column 2 = time step 2, etc.). This format is cumbersome in that it cannot be directly fed into the R lm() function for linear model fitting. Furthermore, we needed the output formatted with columns containing slope, intercept, and R-squared values for each site (rows). All of the re-formatting, and model fitting can be done by hand, using basic R functions, however this task seemed like a good case study for the use of the reshape and plyr packages for R. The reshape package can be used to convert between "wide" and "long" format-- the first step in the example presented below. The plyr package can be used to split a data set into subsets (based on a grouping factor), apply an arbitrary function to the subset, and finally return the combined results in several possible formats. The original input data, desired output, and R code are listed below.
Input
2.521 2.312 2.720 2.254 * 2.922 * * 2.291 2.038 * * 1.151
1.675 1.646 1.860 2.517 * 1.986 * * 3.279 3.420 * * 3.059
1.734 1.305 1.774 2.366 * 2.909 * * 2.863 2.958 * * 2.973
1.637 1.632 2.040 1.807 * 1.889 * * 2.081 2.267 * * 2.655
1.967 8.307 8.331 8.698 * 8.236 * * 7.990 8.255 * * 8.041
1.670 1.744 1.982 2.029 * 2.159 * * 3.330 2.945 * * 3.301
1.668 1.816 1.832 2.100 * 2.289 * * 2.745 2.703 * * 3.216
2.304 2.413 2.749 2.827 * 2.978 * * 3.011 3.244 * * 4.494
1.505 2.827 3.375 1.923 * 4.250 * * 1.542 3.094 * * 1.480
Output
site intercept slope Rsq
1 1 2.8123 -0.0894 0.5115
2 2 1.5229 0.1512 0.7682
3 3 1.5499 0.1351 0.7445
4 4 1.5581 0.0738 0.8453
5 5 6.1738 0.2174 0.1727
6 6 1.4787 0.1527 0.9026
7 7 1.5340 0.1270 0.9871
8 8 2.1403 0.1437 0.8224
9 9 2.7546 -0.0425 0.0306
Add required libraries and load example data files
# these may need to be installed with install.packages()
library(plyr)
library(reshape)
# this one comes with the base install of R
library(lattice)
# read in the data as pasted from original format
d1 <- read.table('d1.txt', na.strings='*')
d2 <- read.table('d2.txt')
Reshape data
# transpose and convert to long format
d1.long <- melt(t(d1))
d2.long <- melt(t(d2))
# give resonable names
names(d1.long) <- c('obs','site','value')
names(d2.long) <- c('obs','site','value')
# add time variable
# 1:n obs * m sites
d1.long$time <- rep(1:13, 9)
d2.long$time <- rep(1:8, 7)
Visually check patterns
xyplot(value ~ time | factor(site), type=c('p','r'), data=d1.long, as.table=TRUE)
xyplot(value ~ time | factor(site), type=c('p','r'), data=d2.long, as.table=TRUE)
Extract linear model terms and R-squared for each subset
fit.summary <- function(i)
{
# fit linear model to this set of the data
l <- lm(value ~ time, data=i)
# extract model terms
l.coef <- coef(l)
# extract R-squared
l.rsq <- summary(l)$r.squared
# combine model details into a single vector
l.details <- c(l.coef, l.rsq)
# rename elements of the vector
names(l.details) <- c('intercept', 'slope', 'Rsq')
# return rounded values to the calling function
return(round(l.details, 4))
}
# compute lm details by site
ddply(d1.long, .(site), fit.summary)
ddply(d2.long, .(site), fit.summary)
Creating a Custom Panel Function (R - Lattice Graphics)
The Experiment
It was necessary (for the purposes of this exercise) to generate some grouped data worthy of a creative panel function. An experiment was designed to test the coordination of 4 individuals (each a panel in the figure below), as a function of "clarity of mind" (symbol color in the figure below). The actual details of the experiment can be deduced from the attached data file, and careful inspection of the resulting plot. A similar experiment was conducted some time ago to demonstrate the Spatstat package in R.
A Customized Panel Function for Lattice Graphics -- "panel.bulls_eye()"
Lattice graphics are one of several possible visualization methods in available in R that are most useful when working with grouped data. Plots are generated via a formula interface, often in the format of y ~ x | f -- where y is the dependent variable, x is the independent variable, and f is a grouping factor. Have a look at the attached file (bottom of page) for an example of data in this format. Each panel in the plot is generated by a panel function, using a subset of the original data as defined by the grouping variable. In most situations the standard panel functions, such as panel.xyplot, are sufficient. However, when working with more "interesting" data, a customized panel function is the way to go.
In order to try the sample code out, you will need to:
- install the required packages
- copy and paste the panel.bulls_eye function source into an R session
- download the sample data file
- run the code listed in the sample R session
Since panel functions are made to be generic, any data source that is similar in nature to the sample can be directly plotted using this code-- i.e. if the experiment were to be repeated using 8 subjects instead of 4. Enjoy.
Panel Function Source
panel.bulls_eye <- function(x, y, groups, subscripts, ...)
{
# setup the initial plot, and add the raw data
panel.xyplot(jitter(x), jitter(y), groups=groups, subscripts=subscripts, cex=1.25, pch=3, col=c(1,2,3), ...)
# add the bull's eye
panel.points(0, 0, pch=16, cex=0.25, col='grey')
panel.points(0, 0, pch=1, cex=1.75, col='grey')
panel.points(0, 0, pch=1, cex=4, col='grey')
panel.points(0, 0, pch=1, cex=7, col='grey')
# compute the mean cartesian distance from the bull's eye to all points
z <- signif(mean(sqrt(x^2 + y^2)), 3)
z.text <- paste(z, 'cm')
# compute the mean angle between all points and bull's eye
theta <- circ.mean(atan2(y, x))
theta.text <- paste(signif(theta* 180/pi, 2), 'deg')
# generate a displacement vector
x_prime <- z * cos(theta)
y_prime <- z * sin(theta)
# add the vector to the plot
panel.segments(0, 0, 3, 0, col='grey')
panel.arrows(0, 0, x_prime, y_prime, length=0.1, col='black')
# annotate with accuracy and displacement angle
grid.text(label = z.text, gp=gpar(fontsize=16), just='left',
x = unit(0.05, "npc"),
y = unit(0.95, "npc"))
grid.text(label = theta.text, gp=gpar(fontsize=16), just='right',
x = unit(0.90, "npc"),
y = unit(0.95, "npc"))
}
Example Session (note that several packages are required)
# load required libraries
library(spatstat)
library(lattice)
library(grid)
library(CircStats)
# read in our data (see attached file)
x <- read.csv('beer_battle.csv')
# plot the data, as stratified by person
xyplot(y ~ x | person, groups=beer, data=x, panel=panel.bulls_eye,
key=list(points=list(col=c(1,2,3), pch=c(3,3,3)), text=list(c('0 beers', '1 beer', '3 beers')), columns=3),
main='Beer Battle 1'
)
Results: example output from the panel.bulls_eye() function used with xyplot().
Customized Scatterplot Ideas
Panel function for visualizing univariate statistics
panel.dist_summary <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 3) )
#hist(x, probability=T, add=T)
#default color and line style for density plot
density.col = 'gray'
density.lty = 3
# is this a normally distributed dataset?
# if so, change the color of the density plot
# The test rejects the null hypothesis if W is too small.
s.W <- shapiro.test(x)$statistic
if( (s.W > 0.91) == TRUE)
{
density.col = 'gray'
density.lty = 1
}
# compute and plot density
d <- density(x)
dy <- d$y / max(d$y) * .5
lines(d$x, dy, col=density.col, lty=density.lty)
# get a small increment to use in the next tests:
delta <- abs(min(x) - max(x)) / 100
y_mean <- dy[d$x < mean(x) + delta & d$x > mean(x) - delta][1]
y_median <- dy[d$x < median(x) + delta & d$x > median(x) - delta][1]
debug
#print(y_median)
#add points on the density plot for the mean and median
points( c(mean(x), median(x)), c(y_mean, y_median), col=c('red', 'orange'), pch=16)
#add a boxplot
boxplot(x, horizontal=TRUE, boxwex=0.3, add=T)
#debugging
#print(s.W)
}
Panel function for printing joint correlation statistic
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, cor.method="pearson")
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, method=cor.method))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex * r, col='gray')
# might be interesting to use ks.test
# http://www.physics.csbsju.edu/stats/KS-test.html
}
Example usage with built-in datasets
# enforce square plotting area
par(pty='s')
pairs(USJudgeRatings[1:5], upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
 
pairs(iris[1:4], upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
# use spearman correlation calculation instead of default person:
pairs(iris[1:4], upper.panel=function(...) panel.cor(..., cor.method="spearman"), lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
# color iris specis: note location of 'col=' argument
pairs(iris[1:4], upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1, pch=16, col=c("red4", "green3", "blue4")[unclass(iris$Species)]), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2 )
pairs(trees, upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
pairs(swiss, upper.panel=panel.cor, lower.panel=function(...) panel.smooth(...,
col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
# using formula notation:
pairs( ~ Fertility + Education + Catholic, data=swiss, upper.panel=panel.cor, lower.panel=function(...) panel.smooth(..., col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
pairs(longley, upper.panel=panel.cor, lower.panel=function(...) panel.smooth(...,
col.smooth=gray(.5), lty=1), diag.panel=panel.dist_summary, cex.labels = 2, font.labels=2)
Exploration of Multivariate Data
library(gclus)
library(car)
library(MASS)
library(cluster)
library(lattice)
library(TeachingDemos)
data(wine)
# chernoff faces
faces(aggregate(wine, list(wine$Class), FUN=mean)[,-c(1,2)], ncol=3, nrow=1)
# LDA on wine data
l <- lda(Class ~ . , data=wine)
plot(l, col=wine$Class)
## Some soils data from the car package
# LDA: all horizons
l <- lda(Contour ~ pH + N + Dens + P + Ca + Ca + Mg + K + Na, data=Soils)
plot(l, col=as.numeric(Soils$Contour))
# just the top horizon
l <- lda(Contour ~ pH + N + Dens + P + Ca + Ca + Mg + K + Na, data=Soils, subset=Depth=='0-10')
plot(l, col=as.numeric(Soils$Contour[Soils$Depth == '0-10']))
Interactive 3D plots with the rgl package
Overview
Sample application of the RGL package. This package allows for the creation of interactive, 3D figures, complete with lighting and material effects. Try demo(rgl) for an idea of what is possible.
A random number generator sphere (RNG sphere) was created based on the suggestions in Keys to Infinity by Clifford A. Pickover, pp. 237-239. The RNG sphere can be used to test the robustness of a random number generator. Three random number generators were tested: runif() from R, rand from Excel, and a logistic-derived psudo-random number generator. The location (x,y,z) and color of the spheres are based on the sequence of random numbers (Pickover, 1995). An ideal RNG shpere should have no discernable patterns. Note that the logistic-derived random numbers show distinct correlation in the RNG sphere. Excel random number list, and source code (R) are attached at the botom of the page.
RGL sample application: 3d interactive interface to a random number generator sphere. Random numbers from runif() function in R.
RGL sample application 2: Excel random number visualization: 3d interactive interface to a random number generator sphere. Random numbers from rand() function in MS Excel.
RGL sample application 2: Random numbers from the logistic function: 3d interactive interface to a random number generator sphere. Random numbers from the logistic function (see notes), implemented in R.
Random Number Generator (RNG) Sphere Function Definition
# simple function for
rng_sphere <- function(d, type='rgl')
{
n <- length(d)
nn <- n - 3
# init our x,y,z coordinate arrays
x <- array(dim=nn)
y <- array(dim=nn)
z <- array(dim=nn)
# init red,green,blue color component arrays
cr <- array(dim=nn)
cg <- array(dim=nn)
cb <- array(dim=nn)
# convert lagged random numbers from d into spherical coordinates
# then convert to cartesian x,y,z coordinates for simple display
for (i in 1:nn)
{
theta <- 2*pi*d[i]
phi <- pi*d[i+1]
r <- sqrt(d[i+2])
x[i] <- r * sin(theta) * cos(phi)
y[i] <- r * sin(theta) * sin(phi)
z[i] <- r * cos(theta)
# give each location a color based on some rules
cr[i] <- d[i] / max(d)
cg[i] <- d[i+1] / max(d)
cb[i] <- d[i+2] / max(d)
} # end function
if( type == 'rgl')
{
# setup rgl environment:
zscale <- 1
# clear scene:
clear3d("all")
# setup env:
bg3d(color="white")
light3d()
# draw shperes in an rgl window
spheres3d(x, y, z, radius=0.025, color=rgb(cr,cg,cb))
}
if(type == '2d')
{
# optional scatterplot in 2D
scatterplot3d(x,y,z, pch=16, cex.symbols=0.25, color=rgb(cr,cg,cb), axis=FALSE )
}
# optionally return results
# list(x=x, y=y, z=z, red=cr, green=cg, blue=cb)
}
Sample
# load required packages
require(scatterplot3d)
require(rgl)
# random number with runif():
d <- runif(2500)
# plot rng sphere with rgl:
rng_sphere(d, type='rgl')
# save results of the rgl window
rgl.snapshot('testing.png')
# plot rng sphere with scatterplot3d:
rng_sphere(d, type='2d')
# save results
dev.copy2eps()
# 2500 excel random numbers
# =rand()
dd <- as.vector(unlist(read.csv('excel_rand.csv')))
rng_sphere(dd, type='rgl')
rgl.snapshot('testing-excel.png')
# 1000 random numbers from the logistic function:
# init an array
ddd <- array(dim=1000)
# seed
ddd[1] <- 0.38273487234
# compute for the next 999 iterations
for (i in 1:999) { ddd[i+1] <- 4 * 1 * ddd[i] * (1 - ddd[i]) }
Making Soil Property vs. Depth Plots
Example with randomly generated data
Generate some data
## generate some profile depths: 0 - 150, in 10 cm increments
depth <- seq<