PostGIS: Spatially enabled Relational Database Sytem

PostGIS is a set of extensions to the relational database management system PostgreSQL, which provide access to spatial constructs, operators, and functions. In other words, PostGIS is a spatially-aware version of Postgresql. Paul Ramsey of Refractions Research has written up a nice summary of how were are using PostGIS.

 
Tuning Tips:

  1. Adjust kernel resources to allow larger memory allocation: (documentation)
  2. Increase the shared_buffers to a larger number (16000)
  3. See attached PDF at the bottom of this page for excellent ideas from Kevin Neufeld

 
PostGIS Syntax Examples:

  1. PostGIS Spatial SQL cheat-sheet
  2. Official Documentation
  3. PostGIS Wiki

 
PostGIS Presentations:

  1. Tips for the PostGIS Power User (Kevin Neufeld, Refractions Research)
  2. PostGIS Spatial Database: Introduction and Case Studies (Paul Ramsey)
  3. Transitioning Low Earth Orbit Satellite Archive Data from Informix (Geodetic DataBlade) to PostgreSQL (PostGIS) (Churngwei Chu, NASA/SSAI)
  4. Intro to PostgreSQL/PostGIS and MapServer (Arnulf Christ)

Importing and Exporting

 

Tips

  • Attribute data stored with a shapefile (the .dbf file) can only use column names shorter than 14 characters. If you try and export a postGIS table that has column names longer than 14 characters, they will be truncated when converted to a shapefile.
  • If a column in postGIS is defined as numeric, ogr2ogr will not correctly interpret the values in this column: i.e. floating point values less than 1 will be truncated to 0. In order to avoid this problem always define your colums as double precision or float. For dynamically generated tables, cast affected columns to a defined numeric precision: select colum_a::numeric(7,3) where '7' refers to the total number of digits in the column, and '3' refers to the number of decimal places.

 

Geometry and attribute data to GIS file format

 
GDAL/OGR tools This approach allows simultaneous conversion of coordinate systems, but is less flexible with respect to generation of new tables in PostGIS.

  • Import
     
    ogr2ogr -f "PostgreSQL" PG:'dbname=ssurgo_combined user=xxxx password=xxxx host=postgis.server.edu' input_file.shp
  • Export
     
    ogr2ogr output_file.shp PG:'dbname=ssurgo_combined user=xxxx password=xxxx host=postgis.server.edu' tablename

     
    Note that tables must be correctly 'registered' in the geometry_columns table for this to work:
    INSERT INTO geometry_columns VALUES ('','public','tablename','wkb_geometry',2,SRID,'geomtype');

 
PostGIS Loader/Dumper This approach is the simplest, but does not allow on-the-fly conversion of coordinate systems.

  • Import
     
    shp2pgsql -s SRID -c -g wkb_geometry -I shapefile.shp schema.table | psql -U username -h host database

     
    Note that SRID is the PostGIS 'spatial ref. sys. id' (see the spatial_ref_sys table). See the manual page for shp2pgsql for a complete list of arguments and their meanings.

  • Export
     
    pgsql2shp -f shapefile.shp -h host -u username -P password -k -g wkb_geometry database schema.table

     
    See the manual page for pgsql2shp for a complete list of arguments and their meanings.

 
Where tablename is your newly created table, SRID is the SRID (spatial reference ID) for the geometry in this table, and geomtype is the type of geometry: POINT, LINE, POLYGON, etc.

 

Attribute data to text format

 
CSV format, from within the psql client
\copy tablename TO 'filename.csv' CSV HEADER

 
CSV format, via psql client
echo "select column_list from table_list " | psql --tuples --no-align  -F ","  database > file.csv

 
Tabular data to HTML format, via psql client See output below:
echo "select column_list from table_list " | psql --html database > file.html

 
HTML output from psql

area compname
132472.230854819 Hilmar variant
322819.967391312 Oneil
362729.418301135 Carranza
431948.171760353 Tuff rockland
448784.927049035 Gravel pits
500763.225267798 Snelling variant
518860.954990617 Foster
571640.132661382 Alamo
648973.748756059 Toomes
924327.631201791 Dumps

(10 rows)

Example Spatial SQL Operations on Point, Line and Polygon Geometry Types

  • Points
    • Example
    • Example
    • Example
  • Lines
    • Example
    • Example
    • Example
  • Polygons
    • Example
    • Example
    • Example
  • Mixed
    • Example
    • Example
    • Example

Affine Transformation Operations in PostGIS

 
Overview
The ST_Affine() function from PostGIS is useful for manipulating geometries, but requires the elements of a transformation matrix. This page documents progress on automating the computation of the transformation matrix by least-squares (Bruce Rindahl) via SQL. This would allow a pure PostGIS solution to computing and applying affine transformations to geometry data.

 
An open-source algorithm for computing the transformation matrix
Example code from GRASS (v.transform) was used as a template.

 
Approach
Compute transformation matrix based on a table of control points, stored as numbers.

 
Evaluation of results
Comparable to output from a similar analysis done in R, and the original algorithm as implemented in v.transform (GRASS).

Case Study: Fixing Bad TIGER Line data with R and PostGIS

Example of bad Tiger data in Stanislaus County: Red lines are the original road network, green lines are the corrected road network.Example of bad Tiger data in Stanislaus County: Red lines are the original road network, green lines are the corrected road network.

 
The Problem
The US Census does a nice job of collecting all sorts of geographic and demographic information every 10 years. This data is available free of charge in the rather complex and soon to be replaced TIGER/LINE format. While this data covers the entire US down to the local road level, there are numerous errors and even extreme cases of coordinate-shift. Here is an example from Stanislaus County, California. The original TIGER data (red lines) are offset several hundred meters from the imagery. While it is not clear what may have caused the problem, it can be fixed without much effort using an affine transformation. We do not have the transformation matrix, however it can be 'fit' to a set of control points by several methods. The general form of the affine transform can be conveniently represented in homogeneous coordinates as:


Affine Matrix Form: new coordinates on the left-hand side, old coordinates on the right-hand sideAffine Matrix Form in homogeneous coordinates:New coordinates on the left-hand side, old coordinates on the right-hand side. The transformation matrix is the 3x3 matrix in the center.

 
The Solution

  • transform using GRASS and a set of control points with v.transform. This is the natural choice if the data is already in GRASS.
  • Using the same control points, compute the transformation matrix in R, then apply in PostGIS with the Affine() function. This is the ideal approach for our scenario. Note that the format of the 'fitted' transformation matrix returned by R is in a slightly different format:


    Transformation Matrix from RTransformation Matrix from R

We first need a set of control points, good and bad coordinates. This can be accomplished in several ways, we used the d.where command in GRASS:

# load road subset:
v.in.ogr dsn=PG:'dbname=tiger2005se user=xxx password=xxx host=xxx' layer=stan_roads out=s
#
# pick up some control points: note that they will be interleaved with this approach
# odd line numbers are "bad" coordinates
# even line numbers are "good" coordinates
d.where > stan.c_points
#
# we can separate the "bad" and "good" points in R...

Computing the transformation matrix can be done with a simple regression between 'good' and 'bad' coordinates in R. Note that this approach was suggested by Prof. Brian Ripley on the R-help mailing list.

 
Compute the Affine Transformation Matrix in R

## read in control points
## the good and bad points are interleaved
## odd numbers are 'bad' points
## s <- read.table('stan.c_points')
&nbsp;
## s.bad <- s[seq(from=1, to=nrow(s), by=2), ]
## s.good <- s[seq(from=2, to=nrow(s), by=2), ]
##
## make a composite dataframe
## x,y = bad data
## nx,ny = good data
## c <- data.frame(x=s.bad$V1, y=s.bad$V2, nx=s.good$V1, ny=s.good$V2)
##
## save a local copy in the format that v.transform uses
## write.table(c, file='grass_control_pts.txt', row.names=F, col.names=F)
##
&nbsp;
## from now on just use the simplified format:
c <- read.table('grass_control_pts.txt')
names(c) <- c('x','y','nx','ny')

&nbsp;
## compute transformation matrix, and check model fit:
l <- lm(cbind(nx,ny) ~ x + y, data=c)
##
## see at bottom of page -->
summary(l)

&nbsp;
## convert to affine transform matrix to the form needed by PostGIS:
## transpose the regression coefficient matrix:
t(coef(l))
 (Intercept)           x          y
nx    5017.082  1.00231639 0.00918962
ny  -28138.395 -0.01352029 0.99740077


&nbsp;
## check graphically:
x <- seq(-2080000, -2040000, by=1000)
y <- seq(-20000, 5000, by=2000)
d <- expand.grid(x=x, y=y)
p <- predict(l, d)

&nbsp;
par(mfcol=c(1,2), pty='s')
plot(y ~ x, data=c, main='Control Points', cex=.4, pch=16)
points(ny ~ nx, data=c, col='red', cex=.4, pch=16)
## arrows(c$x, c$y, c$nx, c$ny, len=0.05, col='gray')

&nbsp;
plot(d, cex=0.2, main='Shifted Grid')
## arrows(d$x, d$y, p[,1], p[,2], len=0.05, col='gray')
points(p, col='red', cex=.2)

&nbsp;
## sample transformation
print(predict(l, data.frame(x=-2078417.35893968, y=-14810.57043808)), digits=10)
            nx           ny
 -2078350.804 -14809.67334

Establishing the transformation based on control points: Red points represent where the coordinates should be. Black points are the original and incorect coordinates.Establishing the transformation based on control points: Red points represent where the coordinates should be. Black points are the original and incorect coordinates.

 
Check Affine Transform Results in PostGIS

-- matrix format:

-- R:
-- | xoff a b |
-- | yoff d e |

-- postgis wants:
-- ST_Affine(geom, a, b, d, e, xoff, yoff)

-- as a query:
SELECT asText(
ST_Affine(
'POINT(-2078417.35893968 -14810.57043808)',
1.00231639, 0.00918962, -0.01352029, 0.99740077, 5017.082, -28138.395
)
) ;

                   astext                  
--------------------------------------------
 POINT(-2078350.80564006 -14809.6639251817)

-- This is the same as what R says!

 
Perform Affine Transformation in PostGIS

--
-- Step 1. Alter the geometry of the bad data in-place
-- note that we have a backup in 'stan_roads'
--
UPDATE roads SET wkb_geometry =  ST_Affine(wkb_geometry, 1.00231639, 0.00918962, -0.01352029, 0.99740077, 5017.082, -28138.395 ) WHERE module = 'TGR06099' ;

 
Regression Diagnostics

Response nx :

Call:
lm(formula = nx ~ x + y, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-207.088  -23.856    8.614   21.245  161.610 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 5.017e+03  1.369e+03    3.666  0.00079 ***
x           1.002e+00  6.654e-04 1506.386  < 2e-16 ***
y           9.190e-03  9.419e-04    9.756 1.20e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 52.25 on 36 degrees of freedom
Multiple R-Squared:     1,      Adjusted R-squared:     1 
F-statistic: 1.322e+06 on 2 and 36 DF,  p-value: < 2.2e-16 


Response ny :

Call:
lm(formula = ny ~ x + y, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.835 -18.459  -4.556  15.311  94.226 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.814e+04  7.409e+02  -37.98   <2e-16 ***
x           -1.352e-02  3.602e-04  -37.54   <2e-16 ***
y            9.974e-01  5.099e-04 1956.22   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 28.28 on 36 degrees of freedom
Multiple R-Squared:     1,      Adjusted R-squared:     1 
F-statistic: 2.187e+06 on 2 and 36 DF,  p-value: < 2.2e-16

Comparision with output from v.transform

 
First the output from R:
Looking at the residuals from the regression model used to map bad coordinates (x,y) to good coordinates (nx,ny):

## we made the linear model object 'l' above
## extract as dataframe
## residuals are computed for x and y separately
r <- as.data.frame(resid(l))

&nbsp;
## compute vector product of the (x,y) residuals:
c$resid <- sqrt(r$nx^2 + r$ny^2)
print(c)

          x          y       nx         ny      resid
1  -2078417 -14810.570 -2078314 -14838.378  46.617600
2  -2078743 -16057.955 -2078636 -16081.790  62.041274
3  -2077261 -16435.348 -2077170 -16463.156  40.905132
4  -2076709 -14405.369 -2076606 -14433.177  29.406399
5  -2074179 -15830.248 -2074084 -15901.558  33.111981
6  -2073850 -15707.435 -2073763 -15798.554  37.362736
7  -2073450 -13873.171 -2073359 -13920.712  21.623235
8  -2072359 -15204.613 -2072276 -15323.138  38.997678
9  -2072545 -14402.596 -2072450 -14513.219  32.918889
10 -2072189 -16022.434 -2072098 -16129.106  33.834074
11 -2071991 -16856.058 -2071928 -16942.976   6.277554
12 -2068407 -12999.396 -2068296 -13133.170   6.579285
13 -2069870 -12613.813 -2069764 -12731.848   2.357631
14 -2067635 -13188.253 -2067517 -13337.765  11.604519
15 -2066931 -13377.110 -2066809 -13518.753  22.719625
16 -2067411 -15084.692 -2067313 -15190.924  41.907273
17 -2066795 -18714.093 -2066741 -18846.019  14.541358
18 -2066384 -17080.538 -2066299 -17212.464  26.717495
19 -2068634 -19742.339 -2068580 -19835.464  27.483654
20 -2053326 -16930.710 -2053276 -17226.351  65.746074
21 -2051797 -17321.500 -2051899 -17579.762 227.516944
22 -2068307   2826.921 -2068066   2638.276  12.587853
23 -2067543   2648.205 -2067328   2449.631  37.729747
24 -2067126   4276.510 -2066904   4081.246  46.774630
25 -2066748   4170.604 -2066527   4001.816  59.509843
26 -2066068   2292.295 -2065860   2094.699  46.681553
27 -2065337   2107.872 -2065126   1900.397  43.386956
28 -2064606   1913.570 -2064378   1692.922  26.460389
29 -2064199   3558.561 -2063961   3356.401  47.742696
30 -2037464   6512.455 -2037076   5864.398  50.762994
31 -2036722   6825.682 -2036338   6199.227  22.699467
32 -2036876   6366.642 -2036498   5742.888  22.120176
33 -2040225   7150.180 -2039706   6575.029 161.631199
34 -2041064   7144.779 -2040732   6569.629  26.657582
35 -2044702 -15548.033 -2044564 -16024.903  23.844817
36 -2043992 -15723.521 -2043824 -16223.282  48.063840
37 -2043790 -14907.119 -2043611 -15383.990  34.844851
38 -2040616 -14820.445 -2040453 -15349.974  21.196233
39 -2039595 -15081.427 -2039485 -15588.263  47.263287

 
The Root-Mean-Square-Error (RMSE) for the fitted transform (in meters) is:

apply( (fitted(l) - c[, c('nx', 'ny')])^2  , 2, function(x) sqrt(mean(x)) )
      nx       ny
50.19954 27.17333

&nbsp;
#converting to the vector sum of RMSE (as reported in v.transform):
sqrt(50.19954^2 + 27.17333^2)
57.08225

 
The output from v.transform on the same set of control points:

Transformation Matrix
| xoff a b |
| yoff d e |
-------------------------------------------
5301.399323 1.002469 0.009172 
-28155.882288 -0.013530 0.997547 
-------------------------------------------

 
full output including the residuals:

               CHECK MAP RESIDUALS
                Current Map                  New Map
 POINT      X coord    Y coord  |        X coord   Y coord    |      residuals

  1.   -2078417.36    -14810.57 |  -2078314.07      -14838.38 |        46.81
  2.   -2078743.11    -16057.95 |  -2078635.85      -16081.79 |        62.22
  3.   -2077261.34    -16435.35 |  -2077169.97      -16463.16 |        41.05
  4.   -2076709.16    -14405.37 |  -2076605.87      -14433.18 |        29.59
  5.   -2074178.76    -15830.25 |  -2074083.67      -15901.56 |        33.21
  6.   -2073849.93    -15707.44 |  -2073762.78      -15798.55 |        37.42
  7.   -2073449.80    -13873.17 |  -2073358.68      -13920.71 |        21.62
  8.   -2072358.86    -15204.61 |  -2072275.89      -15323.14 |        39.02
  9.   -2072544.55    -14402.60 |  -2072449.73      -14513.22 |        32.97
 10.   -2072188.97    -16022.43 |  -2072098.11      -16129.11 |        33.87
 11.   -2071991.43    -16856.06 |  -2071928.22      -16942.98 |         6.27
 12.   -2068406.55    -12999.40 |  -2068296.38      -13133.17 |         6.60
 13.   -2069870.19    -12613.81 |  -2069763.96      -12731.85 |         2.33
 14.   -2067635.38    -13188.25 |  -2067517.35      -13337.76 |        11.63
 15.   -2066931.10    -13377.11 |  -2066809.13      -13518.75 |        22.74
 16.   -2067411.11    -15084.69 |  -2067312.75      -15190.92 |        41.93
 17.   -2066795.16    -18714.09 |  -2066740.84      -18846.02 |        14.64
 18.   -2066383.87    -17080.54 |  -2066298.50      -17212.46 |        26.74
 19.   -2068634.37    -19742.34 |  -2068580.05      -19835.46 |        27.53
 20.   -2053326.48    -16930.71 |  -2053275.51      -17226.35 |        66.09
 21.   -2051797.30    -17321.50 |  -2051899.25      -17579.76 |       227.91
 22.   -2068307.24      2826.92 |  -2068065.64        2638.28 |        12.41
 23.   -2067542.73      2648.21 |  -2067327.61        2449.63 |        37.44
 24.   -2067125.72      4276.51 |  -2066903.98        4081.25 |        46.40
 25.   -2066748.43      4170.60 |  -2066526.69        4001.82 |        59.12
 26.   -2066067.79      2292.29 |  -2065860.31        2094.70 |        46.35
 27.   -2065336.69      2107.87 |  -2065125.92        1900.40 |        43.07
 28.   -2064605.58      1913.57 |  -2064378.35        1692.92 |        26.16
 29.   -2064199.15      3558.56 |  -2063961.13        3356.40 |        47.43
 30.   -2037464.39      6512.45 |  -2037075.56        5864.40 |        50.66
 31.   -2036721.82      6825.68 |  -2036338.39        6199.23 |        22.54
 32.   -2036875.74      6366.64 |  -2036497.71        5742.89 |        21.95
 33.   -2040224.67      7150.18 |  -2039706.23        6575.03 |       161.54
 34.   -2041064.45      7144.78 |  -2040732.32        6569.63 |        26.74
 35.   -2044701.68    -15548.03 |  -2044564.34      -16024.90 |        23.64
 36.   -2043992.10    -15723.52 |  -2043824.24      -16223.28 |        47.60
 37.   -2043789.90    -14907.12 |  -2043610.60      -15383.99 |        34.35
 38.   -2040615.94    -14820.44 |  -2040453.30      -15349.97 |        20.77
 39.   -2039594.70    -15081.43 |  -2039485.02      -15588.26 |        47.85


  Number of points: 39
  Residual mean average: 57.082951

Higher Order Transformations

 
... Continuing from the initial example session in R ...

An affine transformation is based on a linear mapping between two coordinate-spaces. Testing for non-linearity (i.e. higher order model terms) can be a useful diagnostic in choosing the simpler affine transform.

 
Compute the difference between the good and bad coordinates

## compute vector difference and save back to original data frame:
c$sqdiff <- with(c, sqrt((nx-x)^2 + (ny-y)^2))

 
Generate two linear models:

## generate simple model
l.simple <- lm(sqdiff ~ nx + ny, data=c)
## model with higher order terms (up to 3rd-order):
## note to self: poly() computes orthagonal polynomials
l.poly <- lm(sqdiff ~ poly(nx, 3) + poly(ny, 3), data=c)

&nbsp;
## summarize the complex model:
summary(l.poly)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    302.38       3.97  76.174  < 2e-16 ***

poly(nx, 3)1  1091.17      35.82  30.466  < 2e-16 ***
poly(nx, 3)2   165.59      32.78   5.051 1.71e-05 ***
poly(nx, 3)3   -51.21      28.74  -1.782   0.0843 .  

poly(ny, 3)1   417.35      31.97  13.056 2.30e-14 ***
poly(ny, 3)2   -18.50      30.04  -0.616   0.5425    
poly(ny, 3)3    19.41      35.49   0.547   0.5882    

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

 
Is one model significantly better than the other?

## check via RMSE:
## simple model
sqrt( mean((fitted(l.simple) - c$sqdiff)^2) )
36.54747

&nbsp;
## model with 3rd-order polynomial
sqrt( mean((fitted(l.poly) - c$sqdiff)^2) )
22.45530

&nbsp;
## the anova function can compare nested models:
anova(l.simple, l.poly)

Analysis of Variance Table

Model 1: sqdiff ~ nx + ny
Model 2: sqdiff ~ poly(nx, 3) + poly(ny, 3)
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1     36 52093                                  
2     32 19665  4     32428 13.192 1.865e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 
Check visually:

## re-make grid of points to predict on:
## and save two versions to 'stack-up'
x <- seq(-2080000, -2040000, by=1000)
y <- seq(-20000, 5000, by=2000)
d.simple <- expand.grid(x=x, y=y)
d.poly <- expand.grid(x=x, y=y)

&nbsp;
## and predict the trend in differences:
d.simple$diff <- predict(l.simple, data.frame(nx=d$x, ny=d$y))
d.poly$diff <- predict(l.poly, data.frame(nx=d$x, ny=d$y))

&nbsp;
## stack for grouped plotting
library(lattice)
library(RColorBrewer)
d.combined <- make.groups(simple=d.simple, poly=d.poly)

&nbsp;
## plot:
wireframe(diff ~ x * y | which, data=d.combined, scales=list(arrows=TRUE\), drape=TRUE\, col.regions=brewer.pal(n=9, name='Oranges'), cuts=8, col='white')

Testing for linearity: Two visualizations of the deviance between coordinates positions, at the control point locations.Testing for linearity: Two visualizations of the deviance between coordinates positions, at the control point locations.

 
Conclusions:
It seems that a second order term was only warranted along the x-direction. The more complex model based on 3rd-order polynomials results in a significantly lower RMSE (about 10 meters lower), and is shown to be a better descriptor of variance in the test of nested models.

At the map scale in which the corrected data will be presented, the extra accuracy suggested by the more complex model (coordinate transformation function) is negligible. This allows for the simpler model, which can be directly used by the convenient ST_Affine() function in PostGIS for the heavy-lifting.

Transformation parameters query

The following is a proof of concept query showing how a PostgreSQL query could give the transformation parameters required for an affine transformation. The source of the procedure is from GRASS at transform.c.

Note that the v.transform code expects the input file (control points) to be in the form ax ay bx by, where 'a' is the starting coordinate system (the bad coordinates in the previous example) and 'b' is the target system (the good coordinates). The

 
SQL version is as follows:

 
Sample Session

  1. CREATE TABLE link (
        gid integer NOT NULL,
        a_x double precision,
        a_y double precision,
        b_x double precision,
        b_y double precision
    );
    -- load data into table
    -- see attached file at bottom of page
    -- possibly use the file 'grass_control_pts.csv' ???
    COPY link FROM 'control_points.txt' ;
  2. CREATE TEMPORARY TABLE cc AS
    SELECT
    (SELECT count(b_x) FROM link) AS cc00,
    (SELECT sum(b_x) FROM link) AS cc01,
    (SELECT sum(b_y) FROM link) AS cc02,
    sum(l1.b_x * l2.b_x) AS cc11,
    sum(l1.b_x * l2.b_y) AS cc12,
    sum(l1.b_y * l2.b_y) AS cc22
     FROM link l1 , link l2 WHERE l1.gid = l2.gid;
  3. CREATE TEMPORARY TABLE aa AS
    SELECT
    (SELECT sum(a_y) FROM link) AS aa0,
    sum(l1.a_y * l2.b_x) AS aa1,
    sum(l1.a_y * l2.b_y) AS aa2
     FROM link l1 , link l2 WHERE l1.gid = l2.gid;
  4. CREATE TEMPORARY TABLE bb AS
    SELECT
    (SELECT sum(a_x) FROM link) AS bb0,
    sum(l1.a_x * l2.b_x) AS bb1,
    sum(l1.a_x * l2.b_y) AS bb2
     FROM link l1 , link l2 WHERE l1.gid = l2.gid;
  5. CREATE TEMPORARY TABLE det AS
    SELECT
    cc00*cc11*cc22 +
    cc01*cc12*cc02 +
    cc02*cc01*cc12 -
    cc00*cc12*cc12 -
    cc01*cc01*cc22 -
    cc02*cc11*cc02 AS det FROM cc;
  6. CREATE TEMPORARY TABLE inv_cc AS
    SELECT
    (cc11*cc22-cc12*cc12)/det AS cc00,
    (cc12*cc02-cc01*cc22)/det AS cc01,
    (cc01*cc12-cc11*cc02)/det AS cc02,
    (cc00*cc22-cc02*cc02)/det AS cc11,
    (cc01*cc02-cc00*cc12)/det AS cc12,
    (cc00*cc11-cc01*cc01)/det AS cc22
    FROM cc,det;
  7. SELECT
    (aa0*cc00+aa1*cc01+aa2*cc02) AS b0,
    (aa0*cc01+aa1*cc11+aa2*cc12) AS b1,
    (aa0*cc02+aa1*cc12+aa2*cc22) AS b2,

    (bb0*cc00+bb1*cc01+bb2*cc02) AS b3,
    (bb0*cc01+bb1*cc11+bb2*cc12) AS b4,
    (bb0*cc02+bb1*cc12+bb2*cc22) AS b5
    FROM aa,bb,inv_cc;

This query requires a table called link with the following fields - gid (primary key), a_x, a_y , b_x, b_y. The 'a' values are the 'from' coordinates and the 'b' values are the 'to' coordinates. Using the attached control points the result of this query is:

        b0        |         b1          |        b2         |        b3        |        b4        |         b5          
------------------+---------------------+-------------------+------------------+------------------+---------------------
 -28138.394850347 | -0.0135202854235867 | 0.997400773420259 | 5017.08164289594 | 1.00231638907948 | 0.00918961946271679

 
These are the exact results from 'R'

User defined Function in PL/pgSQL to compute the transformation parameters

Based on the results of the proof of concept example developed previously, a single function was developed in the procedural language for the PostgreSQL database system called PL/pgSQL. The only input parameter to this procedure is a text string that results in a table in the following format:

gid (integer), from_x (float), from_y (float), to_x (float), to_y (float)

Note the table must have the above fields but they can be in any order and can have additional fields. The gid field must be unique for each record.

The use of a SQL query found to be the simplest way to avoid the difficulties in programing the procedure without the need for temporary tables. An added benefit is the control point data can be in almost any format as long as it can be arranged in the format specified above. For example if the "from" points are in a geometry column (the_geom) in a table called from_pts and the corresponding "to" points are in a similar table called to_pts with a common attribute called "link_id", then the query would be:

SELECT from_pts.link_id AS gid, x(from_pts.the_geom) AS from_x, y(from_pts.the_geom) AS from_y, x(to_pts.the_geom) AS to_x, y(to_pts.the_geom) AS to_y FROM from_pts, to_pts WHERE from_pts.link_id = to_pts.link_id

 
Other table layouts and queries are possible depending on the manner in which the control points are collected.

 
The following is the SQL code to add a new procedure called trans_param() into a PostGIS database:

-- Function: trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision)

-- DROP FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision);

CREATE OR REPLACE FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision) AS
$BODY$
DECLARE
        cc_row record;

        cc_det float;

        inv_cc00 float;
        inv_cc01 float;
        inv_cc02 float;
        inv_cc11 float;
        inv_cc12 float;
        inv_cc22 float;

BEGIN
        EXECUTE 'SELECT
        count(a.to_x) as cc00,
        sum(a.to_x) as cc01,
        sum(a.to_y) as cc02,
        sum(a.to_x * b.to_x) as cc11,
        sum(a.to_x * b.to_y) as cc12,
        sum(a.to_y * b.to_y) as cc22,
        sum(a.from_y) as aa0,
        sum(a.from_y * b.to_x) as aa1,
        sum(a.from_y * b.to_y) as aa2,
        sum(a.from_x) as bb0,
        sum(a.from_x * b.to_x) as bb1,
        sum(a.from_x * b.to_y) as bb2
        from ('
|| sql || ') a,(' || sql|| ') b WHERE a.gid = b.gid' INTO cc_row;


        SELECT INTO cc_det
                cc_row.cc00*cc_row.cc11*cc_row.cc22 +
                cc_row.cc01*cc_row.cc12*cc_row.cc02 +
                cc_row.cc02*cc_row.cc01*cc_row.cc12 -
                cc_row.cc00*cc_row.cc12*cc_row.cc12 -
                cc_row.cc01*cc_row.cc01*cc_row.cc22 -
                cc_row.cc02*cc_row.cc11*cc_row.cc02;

        SELECT INTO inv_cc00 (cc_row.cc11*cc_row.cc22-cc_row.cc12*cc_row.cc12)/cc_det;
        SELECT INTO inv_cc01 (cc_row.cc12*cc_row.cc02-cc_row.cc01*cc_row.cc22)/cc_det;
        SELECT INTO inv_cc02 (cc_row.cc01*cc_row.cc12-cc_row.cc11*cc_row.cc02)/cc_det;
        SELECT INTO inv_cc11 (cc_row.cc00*cc_row.cc22-cc_row.cc02*cc_row.cc02)/cc_det;
        SELECT INTO inv_cc12 (cc_row.cc01*cc_row.cc02-cc_row.cc00*cc_row.cc12)/cc_det;
        SELECT INTO inv_cc22 (cc_row.cc00*cc_row.cc11-cc_row.cc01*cc_row.cc01)/cc_det;

        SELECT INTO a cc_row.bb0*inv_cc01+cc_row.bb1*inv_cc11+cc_row.bb2*inv_cc12;
        SELECT INTO b cc_row.bb0*inv_cc02+cc_row.bb1*inv_cc12+cc_row.bb2*inv_cc22;
        SELECT INTO d cc_row.aa0*inv_cc01+cc_row.aa1*inv_cc11+cc_row.aa2*inv_cc12;
        SELECT INTO e cc_row.aa0*inv_cc02+cc_row.aa1*inv_cc12+cc_row.aa2*inv_cc22;
        SELECT INTO xoff cc_row.bb0*inv_cc00+cc_row.bb1*inv_cc01+cc_row.bb2*inv_cc02;
        SELECT INTO yoff cc_row.aa0*inv_cc00+cc_row.aa1*inv_cc01+cc_row.aa2*inv_cc02;

END;

$BODY$
  LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION trans_param(IN sql text, OUT a double precision, OUT b double precision, OUT d double precision, OUT e double precision, OUT xoff double precision, OUT yoff double precision) OWNER TO postgres;

 
Currently there is no error checking in the code if the determinant is zero.

 
To use the procedure simply use: SELECT trans_param('my SQL text')

 
The data from this series of articles is stored in a table called link. The id of the points is gid and the "from" values are b_x and b_y. The "to" values are a_x and a_y. Thus the query is:

SELECT gid,b_x AS from_x, b_y AS from_y, a_x AS to_x, a_y AS to_y FROM link

 
The result of the procedure is:

SELECT * FROM trans_param(
'SELECT gid,b_x as from_x, b_y as from_y, a_x as to_x, a_y as to_y from link')

0.997546509279282;-0.00917177514909895;0.0135300872142122;1.00246938174737;-5301.39933295548;28155.8822879205

 
Which matches the results returned from GRASS and R. Additional queries will be developed to give a table of residuals, the RMS error and the actual transformation of geometry.

Control Points Table Format

This is a start for discussions to create a series of function to perform an affine transformation of a PostGIS data set using a table of control points.

The first step is what is the format or layout of the control points?

I don't think point geometries is necessarily a good idea. The points must have an exact 1 to 1 relationship. To assure this you either have to maintain absolute integrity on the keys between two tables or have two geometries in one table. Both would be a hassle. An interleaved table format also would give me problems because the queries get really difficult and what if one version used "good" points in the odd rows and the next one put them in the even ones? In addition, if you use a seguential id and add one row "good" then add a "bad" one, then realize the bad one is really bad and delete it then the gid's will not be odd/even anymore in relation to "good" vs "bad". I think ESRI has the right idea here (God did I actually say that?). Look at the link table interface when you are georeferencing an image. It gives the id, XSource, YScource, Xmap, YMap, and the residual error. This could easily be done with the code - just one more query.

Given a table in this format a query could be done to give the RMS error. Also the table could be returned with the residual errors. When the user is happy then the table and the geometry could be input to transform the geometry.

Bruce

Analysis of SSURGO Data in PostGIS: An Overview

 
Note
Several of the pages documenting soil survey related queries have been taken down for a couple of days while I update the syntax.

 
News
It looks like the NCSS has constructed a web-based, SQL interface to their main database. This new tool was recently highlighted in issue 40 of the NCSS Newsletter and looks like a promising new tool with good documentation. A listing "web services" offered by the NCSS is listed on this page.

 
Overview
The analysis of SSURGO data is complicated by a series of hierarchical, one-to-many tabular relationships between spatial and attribute data. Understanding the structure of the SSURGO database is critical for correct interpretation and aggregation of soil properties. Before undertaking any SSURGO-based analysis please take some time to become familliar with the details on the product, including intended uses, minimum mapping units sizes, and other important details. In addition, becoming familliar with the SSURGO metadata is a must. The metadata page includes detailed descriptions of table structure, column names and units, as well as important information on the sources of much of the tabular data included in a SSURGO database.

SSURGO data can be downloaded by survey area from The Soil Datamart, with spatial data delivered in shapefile format and attribute data delivered as plain text. Unfortunately, an M.S. Access database template is required to utilize SSURGO attribute data as delivered from the Soil Datamart. For assistance with this procedure please see the NRCS SSURGO page. Using this approach, most analysis of SSURGO must be done survey-by-survey within a GIS environment. For general instructions see this document.

Several online applications allow for simple interaction with the SSURGO database without the need for a GIS or RDBMS knowledge. Some examples include:

 
An Open Source Approach to SSURGO
We have developed an alternate approach to working with SSURGO data using PostGIS, a spatially-enabled version of the popular and open source relational database management system Postgresql, to store all spatial and tabular data for 138 survey areas. This approach facillitates rapid access, analysis, and aggregation of over half a million soil polygons. SQL (structured query language) is used to directly interact with both soil spatial and attribute data. If other forms of spatial data are imported into PostGIS (such as landcover, climatic data, etc.) nearly all spatial and attribute analysis can be done entirely from PostGIS. A series of examples illustrating common tasks will be presented in the following pages.

 
General Approach to Working with SSURGO (also outlined in this document) (more ideas on map unit composition)

  1. Identify the soil properties that are to be included in some analysis
  2. Decide on the appropriate form of aggregation to be used to summarize horizons
    • depth weighted (i.e. to calculate average percent clay through horizons)
    • top 1m (i.e. to summarize shrink-swell capacity of the topsoil)
    • top horizon (i.e. to summarize surface organic carbon)
    • profile sum (i.e. to calculate total water holding capacity)
    • most limiting (i.e. to calculate the most limiting hydraulic conductivity within the soil profile)
  3. Aggregate horizon data
  4. Decide on the appropriate form of aggregation to be used to summarize components
    • component percent weighted (i.e. this will include information from each component, weighted by their estimated percentage of the entire map unit)
    • largest component (i.e. this usually result in the selection of the 'dominant' component, however when there are multuple components with the same estimated percentage ties will occur)
    • major component flag (i.e. components flagged as a 'major component' by the NRCS represent dominant soil types within a map unit. note that there are sometimes multiple components flagged as 'major components'.)
    • dominant condition (i.e. this is usually used for categorical data like hydric conditions. aggregation is performed by selecting the most frequent condition within a map unit)
  5. Aggregation of component data
  6. join the above aggregated data (2x aggregation process for horizon data) to the map unit polygons
  7. See the diagram at the bottom of this page for a graphical summary

SSURGO discontinuity example: Boundary between Glenn and Colusa counties, illustrating differences in soil survey vintage.SSURGO discontinuity example: Boundary between Glenn and Colusa counties, illustrating differences in soil survey vintage.

 
Notes on the Format of SSURGO

  • Adjacent surveys may have been composed by different individuals, and may be of widely different vintages. Any given survey must comply with basic standards, but older surveys reflect a more generalized approach than more modern surveys. The figure to the right illustrates such differences.
  • Polygons represent a repeating pattern of legend entries: groups of map-able soil concepts called map units
  • Map unit data is stored in the mapunit table, and is referenced by the field mukey
  • Pre-aggregated map unit data is stored in the muaggatt table, and is referenced by the field mukey
  • Map units are comprised of multiple, unmapped soil types called 'components'
  • Component data is stored in the component table, and is referenced by the field cokey
  • Soil components (or soil type) are associated with multiple horizons
  • Horizon data is stored in the chorizon table, and is referenced by the field cokey
  • Since there is a 1:many:many (mapunit:component:horizon) relationship between spatial and horizon-level soil property data two aggregation steps are required in order to produce a thematic map
  • This diagram illustrates the hierarchy of scales involved in soil survey information.

SSURGO Database Structure DiagramSSURGO Database Structure Diagram

Logistics: Getting Connected and Executing Queries

 
Building and Saving SQL code snippets
A simple text editor is the best environment for working on (and saving!) your SQL queries. Linux users are encouraged to use either 'Kate' or 'Kwrite'. Windows users should look into Notetab Light or SciTE. Mac users should check out TextWrangler.

 
Connecting to the database
In general, the simplest way to interact with our composite soil survey database is by connecting to the best with an SSH client. Mac/Linux users have this functionality built-in. Windows users will need to use something like Putty, or Xming. Once an ssh connection with the beast has been setup, you can connect to the database with the following:

psql -U soil ssurgo_combined

 
where psql is the postgresql client program, -U soil means connect as the user called "soil", and ssurgo_combined is the name of the actual database.

 
You can quit from the database shell using \q (followed by enter). Typing \? (followed by enter) will give a list of commands available in the psql shell.

 
Query structure and SQL
Numerous resources exisit for learning about SQL. I would recommend looking at the first couple chapters from the PostgreSQL book by Douglas and Douglas (the big purple book on the shelf). For an interactive learning approach SQL Zoo seems like a good start. In general most queries will have the format:

SELECT column_x, column_y, column_z
FROM table_x
WHERE column_x = 'something'
-- optional
GROUP BY column_x
ORDER BY column_x ; -- semi-colon denotes end of SQL statement

 
A simple query: column selection and filtering
Once connected it is possible to issue queries to the database. The results of a query are normally returned to the screen. For example, asking for the horizon boundaries and horizon names from the componenent identified with the given cokey (467038:646635) might look like this:

-- the query
SELECT cokey, hzname, hzdept_r, hzdepb_r
FROM chorizon
WHERE cokey = '467038:646635' ;

-- the results
     cokey     | hzname | hzdept_r | hzdepb_r
---------------+--------+----------+----------
 467038:646635 | Ap     |        0 |       18
 467038:646635 | Bw     |       18 |       41
 467038:646635 | Bk1    |       41 |       69
 467038:646635 | Bk2    |       69 |      109
 467038:646635 | Bk3    |      109 |      145
 467038:646635 | Bk4    |      145 |      183

 
A more complicated query: column calculation and aggregation
Compute the total water holding capacity for a given component, identified by (467038:646635). Note that several operations are being performed on the data:

  1. compute the available water holding capacity (in centemeters) for each horizon [green text]
  2. compute the profile depth (in centemeters) for each horizon [green text]
  3. aggregate the profile data by summing AHC and depth of each horizon [red text]
-- the query
select cokey, sum( (hzdepb_r - hzdept_r) * awc_r) as component_whc, sum( (hzdepb_r - hzdept_r)  ) as depth 
from chorizon 
where cokey = '467038:646635'
group by cokey ;

-- the results
     cokey     | component_whc | depth 
---------------+---------------+-------
 467038:646635 |         27.91 |   183

 
Note that it is also possible to save the results of a query into a new table. This is the usual strategy when performing any analysis that returns geometry (soil polygons, etc.) as geometry data cannot be visualized in a text display. Tables created in this manner can be exported from the database for map creation or further analysis. Spatial tables can be viewed directly by applications like QGIS or Mapserver. A simple method of accessing PostGIS data is not yet possible with ArcGIS.

Learning by Example: Common Queries

 
Selection, Filtering and Sorting

  • Selecting a summary of horizon thickness, sand, silt, clay and AWC for a given component (chorizon table).
    SELECT  hzname AS name, hzdepb_r - hzdept_r AS thickness, sandtotal_r AS sand, silttotal_r AS silt, claytotal_r AS clay, awc_r AS awc
    FROM chorizon
    WHERE cokey = '467038:646635'
    ORDER BY hzdept_r;

    Query Results
    name | thickness | sand | silt | clay | awc
    --------+-----------+------+------+------+------
    Ap | 18 | 35 | 37 | 28 | 0.16
    Bw | 23 | 35 | 40 | 25 | 0.15
    Bk1 | 28 | 35 | 40 | 25 | 0.16
    Bk2 | 40 | 35 | 40 | 25 | 0.16
    Bk3 | 36 | 35 | 40 | 25 | 0.16
    Bk4 | 38 | 60 | 25 | 15 | 0.13
  • Select some common fields from the component table:
     SELECT cokey, majcompflag, comppct_r, wei, weg, tfact
    FROM component
    WHERE mukey = '467038'
    ORDER BY comppct_r DESC;

    Query Results
    cokey | majcompflag | comppct_r | wei | weg | tfact
    ---------------+-------------+-----------+-----+-----+-------
    467038:646635 | Yes | 85 | 48 | 6 | 5
    467038:646636 | No | 5 | | |
    467038:646638 | No | 4 | | |
    467038:646640 | No | 2 | | |
    467038:646639 | No | 2 | | |
    467038:646637 | No | 2 | | |
  • Create a new table 'yolo_comp' using the component table, for a specific survey area (Yolo County), with components sorted by their component percentages.
    CREATE TABLE yolo_comp AS
    SELECT * FROM component
    WHERE areasymbol = 'ca113'
    ORDER BY mukey, comppct_r DESC ;

     
    The resulting table can be copied to a CSV file (in the current working directory) like this: \copy yolo_comp TO 'yolo_comp.csv' CSV HEADER

     
    If you are done with the table, remove it with the following SQL: DROP TABLE yolo_comp ;

Identifying the Largest Components

 
Overview:
Two methods for the selection of the largest components (based on the comppct_r column) within map units. This approach to selecting a single component per map unit is appropriate in situations where a single representative feature or property is sought. The partitioning of components within a map unit is closely related to the map unit type: complex, association, consociation or an "undifferentiated group". A breakdown of the number of components per each map unit type is summarized by the following query (26400 map units / 45971 major components):

SELECT mapunit.mukind, round(avg(n_components)) AS avg, min(n_components), max(n_components)
FROM mapunit JOIN
        (
        SELECT mapunit.mukey, count(component.mukey) AS n_components
        FROM
        mapunit JOIN component
        ON mapunit.mukey = component.mukey
        WHERE component.majcompflag = 'Yes' AND mapunit.mukind IS NOT NULL
        GROUP BY mapunit.mukey
        ) AS a
ON a.mukey = mapunit.mukey
GROUP BY mapunit.mukind ;

Query Results
mukind | avg | min | max
------------------------+-----+-----+-----
Consociation | 1 | 1 | 3
Complex | 2 | 1 | 4
Association | 3 | 1 | 4
Undifferentiated group | 2 | 1 | 4

 
Method 1: Filtering component percentages with the DISTINCT keyword.

  1. Use of the SELECT DISTINCT ON operator. Note that the ordering is done before the DISTINCT filter is applied, resulting in the largest (by component percentage) component. Note that results from ties are unpredictable.
    • SELECT DISTINCT ON (mukey) mukey, cokey, comppct_r, compname, taxorder
      FROM component
      WHERE areasymbol = 'ca654' AND majcompflag = 'Yes'
      ORDER BY mukey, comppct_r DESC ;

      Query Results
      mukey | cokey | comppct_r | compname | taxorder
      ---------+----------------+-----------+------------+-----------
      1487066 | 1487066:638252 | 80 | Auberry | Alfisols
      1487067 | 1487067:638274 | 80 | Auberry | Alfisols
      1487068 | 1487068:638383 | 85 | Crouch | Mollisols
      1487069 | 1487069:632318 | 85 | Cajon | Entisols
      1487070 | 1487070:632372 | 85 | Excelsior | Entisols
      1487071 | 1487071:632519 | 85 | Kimberlina | Entisols
      1487072 | 1487072:632642 | 85 | Nord | Mollisols
      1487076 | 1487076:632832 | 85 | Wasco | Entisols
      1487077 | 1487077:632866 | 85 | Whitewolf | Entisols
      464171 | 464171:640646 | 85 | Academy | Alfisols
    • We can inspect possible ties or other sources of error by looking at the smallest components.

      SELECT * FROM
              (
              SELECT DISTINCT ON (mukey) mukey, cokey, comppct_r, compname, taxorder
              FROM component
              WHERE areasymbol = 'ca654' AND majcompflag = 'Yes'
              ORDER BY mukey, comppct_r DESC
              ) AS a
      ORDER BY a.comppct_r ASC ;

      Query Results
      mukey | cokey | comppct_r | compname | taxorder
      ---------+----------------+-----------+---------------------+-------------
      464432 | 464432:1380237 | 10 | Ramona | Alfisols
      464184 | 464184:640686 | 30 | Ahwahnee | Alfisols
      464211 | 464211:640756 | 30 | Auberry | Alfisols
      464210 | 464210:640752 | 35 | Auberry | Alfisols
      464529 | 464529:641600 | 35 | Vista | Inceptisols
      464530 | 464530:641604 | 35 | Vista | Inceptisols
      464531 | 464531:641609 | 35 | Rock outcrop |

 
Finding the largest component (based on comppct_r) per mukey

  1. Identify how many map units there are in a given region or survey area. In this case San Joaquin County; areasymbol = 'ca077'.
    SELECT count(mukey) FROM mapunit WHERE areasymbol = 'ca113';

    Query Results
    count
    -------
    186
  2. Identify which map units contain components where a conventional, select largest component percentage type approach will not work. This commonly happens when there are several major components with the same comppct_r value, within the same map unit. Any returned mukey values need to be evaluated.
    SELECT component.mukey, count(component.mukey)
    FROM component
            JOIN
            (
            SELECT mukey, max(comppct_r) AS comppct FROM component WHERE majcompflag = 'Yes' AND areasymbol = 'ca077' GROUP BY mukey
            ) AS a
    ON component.comppct_r = a.comppct AND component.mukey = a.mukey
    GROUP BY component.mukey
    HAVING count(component.mukey) > 1 ;

    Query Results
    mukey | count
    --------+-------
    462043 | 2
    462068 | 2
  3. Evaluation of the two "trouble" map units identified above. Note the use of the where mukey in ('462043', '462068') syntax.
    SELECT mukey, cokey, comppct_r, compname, taxorder
    FROM component
    WHERE majcompflag = 'Yes' AND mukey IN ('462043', '462068')
    ORDER BY mukey, comppct_r DESC ;

    Query Results
    mukey | cokey | comppct_r | compname | taxorder
    --------+---------------+-----------+------------+-----------
    462043 | 462043:634006 | 45 | Dumps |
    462043 | 462043:634007 | 45 | Tailings |
    462068 | 462068:634159 | 30 | Honker | Alfisols
    462068 | 462068:634160 | 30 | Vallecitos | Alfisols
    462068 | 462068:634161 | 25 | Gonzaga | Mollisols

Soil Texture Example

 
Premise

Compute a series of weighted-average soil texture fractions (sand, silt, clay), for every component, of every map unit in Yolo County. These values will be further weighted by the spatial distribution of each map unit.