Case Study: Fixing Bad TIGER Line data with R and PostGIS

Example of bad Tiger data in Stanislaus County: Red lines are the original road network, green lines are the corrected road network.Example of bad Tiger data in Stanislaus County: Red lines are the original road network, green lines are the corrected road network.

 
The Problem
The US Census does a nice job of collecting all sorts of geographic and demographic information every 10 years. This data is available free of charge in the rather complex and soon to be replaced TIGER/LINE format. While this data covers the entire US down to the local road level, there are numerous errors and even extreme cases of coordinate-shift. Here is an example from Stanislaus County, California. The original TIGER data (red lines) are offset several hundred meters from the imagery. While it is not clear what may have caused the problem, it can be fixed without much effort using an affine transformation. We do not have the transformation matrix, however it can be 'fit' to a set of control points by several methods. The general form of the affine transform can be conveniently represented in homogeneous coordinates as:


Affine Matrix Form: new coordinates on the left-hand side, old coordinates on the right-hand sideAffine Matrix Form in homogeneous coordinates:New coordinates on the left-hand side, old coordinates on the right-hand side. The transformation matrix is the 3x3 matrix in the center.

 
The Solution

  • transform using GRASS and a set of control points with v.transform. This is the natural choice if the data is already in GRASS.
  • Using the same control points, compute the transformation matrix in R, then apply in PostGIS with the Affine() function. This is the ideal approach for our scenario. Note that the format of the 'fitted' transformation matrix returned by R is in a slightly different format:


    Transformation Matrix from RTransformation Matrix from R

We first need a set of control points, good and bad coordinates. This can be accomplished in several ways, we used the d.where command in GRASS:

# load road subset:
v.in.ogr dsn=PG:'dbname=tiger2005se user=xxx password=xxx host=xxx' layer=stan_roads out=s
#
# pick up some control points: note that they will be interleaved with this approach
# odd line numbers are "bad" coordinates
# even line numbers are "good" coordinates
d.where > stan.c_points
#
# we can separate the "bad" and "good" points in R...

Computing the transformation matrix can be done with a simple regression between 'good' and 'bad' coordinates in R. Note that this approach was suggested by Prof. Brian Ripley on the R-help mailing list.

 
Compute the Affine Transformation Matrix in R

## read in control points
## the good and bad points are interleaved
## odd numbers are 'bad' points
## s <- read.table('stan.c_points')
&nbsp;
## s.bad <- s[seq(from=1, to=nrow(s), by=2), ]
## s.good <- s[seq(from=2, to=nrow(s), by=2), ]
##
## make a composite dataframe
## x,y = bad data
## nx,ny = good data
## c <- data.frame(x=s.bad$V1, y=s.bad$V2, nx=s.good$V1, ny=s.good$V2)
##
## save a local copy in the format that v.transform uses
## write.table(c, file='grass_control_pts.txt', row.names=F, col.names=F)
##
&nbsp;
## from now on just use the simplified format:
c <- read.table('grass_control_pts.txt')
names(c) <- c('x','y','nx','ny')

&nbsp;
## compute transformation matrix, and check model fit:
l <- lm(cbind(nx,ny) ~ x + y, data=c)
##
## see at bottom of page -->
summary(l)

&nbsp;
## convert to affine transform matrix to the form needed by PostGIS:
## transpose the regression coefficient matrix:
t(coef(l))
 (Intercept)           x          y
nx    5017.082  1.00231639 0.00918962
ny  -28138.395 -0.01352029 0.99740077


&nbsp;
## check graphically:
x <- seq(-2080000, -2040000, by=1000)
y <- seq(-20000, 5000, by=2000)
d <- expand.grid(x=x, y=y)
p <- predict(l, d)

&nbsp;
par(mfcol=c(1,2), pty='s')
plot(y ~ x, data=c, main='Control Points', cex=.4, pch=16)
points(ny ~ nx, data=c, col='red', cex=.4, pch=16)
## arrows(c$x, c$y, c$nx, c$ny, len=0.05, col='gray')

&nbsp;
plot(d, cex=0.2, main='Shifted Grid')
## arrows(d$x, d$y, p[,1], p[,2], len=0.05, col='gray')
points(p, col='red', cex=.2)

&nbsp;
## sample transformation
print(predict(l, data.frame(x=-2078417.35893968, y=-14810.57043808)), digits=10)
            nx           ny
 -2078350.804 -14809.67334

Establishing the transformation based on control points: Red points represent where the coordinates should be. Black points are the original and incorect coordinates.Establishing the transformation based on control points: Red points represent where the coordinates should be. Black points are the original and incorect coordinates.

 
Check Affine Transform Results in PostGIS

-- matrix format:

-- R:
-- | xoff a b |
-- | yoff d e |

-- postgis wants:
-- ST_Affine(geom, a, b, d, e, xoff, yoff)

-- as a query:
SELECT asText(
ST_Affine(
'POINT(-2078417.35893968 -14810.57043808)',
1.00231639, 0.00918962, -0.01352029, 0.99740077, 5017.082, -28138.395
)
) ;

                   astext                  
--------------------------------------------
 POINT(-2078350.80564006 -14809.6639251817)

-- This is the same as what R says!

 
Perform Affine Transformation in PostGIS

--
-- Step 1. Alter the geometry of the bad data in-place
-- note that we have a backup in 'stan_roads'
--
UPDATE roads SET wkb_geometry =  ST_Affine(wkb_geometry, 1.00231639, 0.00918962, -0.01352029, 0.99740077, 5017.082, -28138.395 ) WHERE module = 'TGR06099' ;

 
Regression Diagnostics

Response nx :

Call:
lm(formula = nx ~ x + y, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-207.088  -23.856    8.614   21.245  161.610 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 5.017e+03  1.369e+03    3.666  0.00079 ***
x           1.002e+00  6.654e-04 1506.386  < 2e-16 ***
y           9.190e-03  9.419e-04    9.756 1.20e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 52.25 on 36 degrees of freedom
Multiple R-Squared:     1,      Adjusted R-squared:     1 
F-statistic: 1.322e+06 on 2 and 36 DF,  p-value: < 2.2e-16 


Response ny :

Call:
lm(formula = ny ~ x + y, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.835 -18.459  -4.556  15.311  94.226 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.814e+04  7.409e+02  -37.98   <2e-16 ***
x           -1.352e-02  3.602e-04  -37.54   <2e-16 ***
y            9.974e-01  5.099e-04 1956.22   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 28.28 on 36 degrees of freedom
Multiple R-Squared:     1,      Adjusted R-squared:     1 
F-statistic: 2.187e+06 on 2 and 36 DF,  p-value: < 2.2e-16

Comparision with output from v.transform

 
First the output from R:
Looking at the residuals from the regression model used to map bad coordinates (x,y) to good coordinates (nx,ny):

## we made the linear model object 'l' above
## extract as dataframe
## residuals are computed for x and y separately
r <- as.data.frame(resid(l))

&nbsp;
## compute vector product of the (x,y) residuals:
c$resid <- sqrt(r$nx^2 + r$ny^2)
print(c)

          x          y       nx         ny      resid
1  -2078417 -14810.570 -2078314 -14838.378  46.617600
2  -2078743 -16057.955 -2078636 -16081.790  62.041274
3  -2077261 -16435.348 -2077170 -16463.156  40.905132
4  -2076709 -14405.369 -2076606 -14433.177  29.406399
5  -2074179 -15830.248 -2074084 -15901.558  33.111981
6  -2073850 -15707.435 -2073763 -15798.554  37.362736
7  -2073450 -13873.171 -2073359 -13920.712  21.623235
8  -2072359 -15204.613 -2072276 -15323.138  38.997678
9  -2072545 -14402.596 -2072450 -14513.219  32.918889
10 -2072189 -16022.434 -2072098 -16129.106  33.834074
11 -2071991 -16856.058 -2071928 -16942.976   6.277554
12 -2068407 -12999.396 -2068296 -13133.170   6.579285
13 -2069870 -12613.813 -2069764 -12731.848   2.357631
14 -2067635 -13188.253 -2067517 -13337.765  11.604519
15 -2066931 -13377.110 -2066809 -13518.753  22.719625
16 -2067411 -15084.692 -2067313 -15190.924  41.907273
17 -2066795 -18714.093 -2066741 -18846.019  14.541358
18 -2066384 -17080.538 -2066299 -17212.464  26.717495
19 -2068634 -19742.339 -2068580 -19835.464  27.483654
20 -2053326 -16930.710 -2053276 -17226.351  65.746074
21 -2051797 -17321.500 -2051899 -17579.762 227.516944
22 -2068307   2826.921 -2068066   2638.276  12.587853
23 -2067543   2648.205 -2067328   2449.631  37.729747
24 -2067126   4276.510 -2066904   4081.246  46.774630
25 -2066748   4170.604 -2066527   4001.816  59.509843
26 -2066068   2292.295 -2065860   2094.699  46.681553
27 -2065337   2107.872 -2065126   1900.397  43.386956
28 -2064606   1913.570 -2064378   1692.922  26.460389
29 -2064199   3558.561 -2063961   3356.401  47.742696
30 -2037464   6512.455 -2037076   5864.398  50.762994
31 -2036722   6825.682 -2036338   6199.227  22.699467
32 -2036876   6366.642 -2036498   5742.888  22.120176
33 -2040225   7150.180 -2039706   6575.029 161.631199
34 -2041064   7144.779 -2040732   6569.629  26.657582
35 -2044702 -15548.033 -2044564 -16024.903  23.844817
36 -2043992 -15723.521 -2043824 -16223.282  48.063840
37 -2043790 -14907.119 -2043611 -15383.990  34.844851
38 -2040616 -14820.445 -2040453 -15349.974  21.196233
39 -2039595 -15081.427 -2039485 -15588.263  47.263287

 
The Root-Mean-Square-Error (RMSE) for the fitted transform (in meters) is:

apply( (fitted(l) - c[, c('nx', 'ny')])^2  , 2, function(x) sqrt(mean(x)) )
      nx       ny
50.19954 27.17333

&nbsp;
#converting to the vector sum of RMSE (as reported in v.transform):
sqrt(50.19954^2 + 27.17333^2)
57.08225

 
The output from v.transform on the same set of control points:

Transformation Matrix
| xoff a b |
| yoff d e |
-------------------------------------------
5301.399323 1.002469 0.009172 
-28155.882288 -0.013530 0.997547 
-------------------------------------------

 
full output including the residuals:

               CHECK MAP RESIDUALS
                Current Map                  New Map
 POINT      X coord    Y coord  |        X coord   Y coord    |      residuals

  1.   -2078417.36    -14810.57 |  -2078314.07      -14838.38 |        46.81
  2.   -2078743.11    -16057.95 |  -2078635.85      -16081.79 |        62.22
  3.   -2077261.34    -16435.35 |  -2077169.97      -16463.16 |        41.05
  4.   -2076709.16    -14405.37 |  -2076605.87      -14433.18 |        29.59
  5.   -2074178.76    -15830.25 |  -2074083.67      -15901.56 |        33.21
  6.   -2073849.93    -15707.44 |  -2073762.78      -15798.55 |        37.42
  7.   -2073449.80    -13873.17 |  -2073358.68      -13920.71 |        21.62
  8.   -2072358.86    -15204.61 |  -2072275.89      -15323.14 |        39.02
  9.   -2072544.55    -14402.60 |  -2072449.73      -14513.22 |        32.97
 10.   -2072188.97    -16022.43 |  -2072098.11      -16129.11 |        33.87
 11.   -2071991.43    -16856.06 |  -2071928.22      -16942.98 |         6.27
 12.   -2068406.55    -12999.40 |  -2068296.38      -13133.17 |         6.60
 13.   -2069870.19    -12613.81 |  -2069763.96      -12731.85 |         2.33
 14.   -2067635.38    -13188.25 |  -2067517.35      -13337.76 |        11.63
 15.   -2066931.10    -13377.11 |  -2066809.13      -13518.75 |        22.74
 16.   -2067411.11    -15084.69 |  -2067312.75      -15190.92 |        41.93
 17.   -2066795.16    -18714.09 |  -2066740.84      -18846.02 |        14.64
 18.   -2066383.87    -17080.54 |  -2066298.50      -17212.46 |        26.74
 19.   -2068634.37    -19742.34 |  -2068580.05      -19835.46 |        27.53
 20.   -2053326.48    -16930.71 |  -2053275.51      -17226.35 |        66.09
 21.   -2051797.30    -17321.50 |  -2051899.25      -17579.76 |       227.91
 22.   -2068307.24      2826.92 |  -2068065.64        2638.28 |        12.41
 23.   -2067542.73      2648.21 |  -2067327.61        2449.63 |        37.44
 24.   -2067125.72      4276.51 |  -2066903.98        4081.25 |        46.40
 25.   -2066748.43      4170.60 |  -2066526.69        4001.82 |        59.12
 26.   -2066067.79      2292.29 |  -2065860.31        2094.70 |        46.35
 27.   -2065336.69      2107.87 |  -2065125.92        1900.40 |        43.07
 28.   -2064605.58      1913.57 |  -2064378.35        1692.92 |        26.16
 29.   -2064199.15      3558.56 |  -2063961.13        3356.40 |        47.43
 30.   -2037464.39      6512.45 |  -2037075.56        5864.40 |        50.66
 31.   -2036721.82      6825.68 |  -2036338.39        6199.23 |        22.54
 32.   -2036875.74      6366.64 |  -2036497.71        5742.89 |        21.95
 33.   -2040224.67      7150.18 |  -2039706.23        6575.03 |       161.54
 34.   -2041064.45      7144.78 |  -2040732.32        6569.63 |        26.74
 35.   -2044701.68    -15548.03 |  -2044564.34      -16024.90 |        23.64
 36.   -2043992.10    -15723.52 |  -2043824.24      -16223.28 |        47.60
 37.   -2043789.90    -14907.12 |  -2043610.60      -15383.99 |        34.35
 38.   -2040615.94    -14820.44 |  -2040453.30      -15349.97 |        20.77
 39.   -2039594.70    -15081.43 |  -2039485.02      -15588.26 |        47.85


  Number of points: 39
  Residual mean average: 57.082951

Higher Order Transformations

 
... Continuing from the initial example session in R ...

An affine transformation is based on a linear mapping between two coordinate-spaces. Testing for non-linearity (i.e. higher order model terms) can be a useful diagnostic in choosing the simpler affine transform.

 
Compute the difference between the good and bad coordinates

## compute vector difference and save back to original data frame:
c$sqdiff <- with(c, sqrt((nx-x)^2 + (ny-y)^2))

 
Generate two linear models:

## generate simple model
l.simple <- lm(sqdiff ~ nx + ny, data=c)
## model with higher order terms (up to 3rd-order):
## note to self: poly() computes orthagonal polynomials
l.poly <- lm(sqdiff ~ poly(nx, 3) + poly(ny, 3), data=c)

&nbsp;
## summarize the complex model:
summary(l.poly)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    302.38       3.97  76.174  < 2e-16 ***

poly(nx, 3)1  1091.17      35.82  30.466  < 2e-16 ***
poly(nx, 3)2   165.59      32.78   5.051 1.71e-05 ***
poly(nx, 3)3   -51.21      28.74  -1.782   0.0843 .  

poly(ny, 3)1   417.35      31.97  13.056 2.30e-14 ***
poly(ny, 3)2   -18.50      30.04  -0.616   0.5425    
poly(ny, 3)3    19.41      35.49   0.547   0.5882    

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

 
Is one model significantly better than the other?

## check via RMSE:
## simple model
sqrt( mean((fitted(l.simple) - c$sqdiff)^2) )
36.54747

&nbsp;
## model with 3rd-order polynomial
sqrt( mean((fitted(l.poly) - c$sqdiff)^2) )
22.45530

&nbsp;
## the anova function can compare nested models:
anova(l.simple, l.poly)

Analysis of Variance Table

Model 1: sqdiff ~ nx + ny
Model 2: sqdiff ~ poly(nx, 3) + poly(ny, 3)
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1     36 52093                                  
2     32 19665  4     32428 13.192 1.865e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 
Check visually:

## re-make grid of points to predict on:
## and save two versions to 'stack-up'
x <- seq(-2080000, -2040000, by=1000)
y <- seq(-20000, 5000, by=2000)
d.simple <- expand.grid(x=x, y=y)
d.poly <- expand.grid(x=x, y=y)

&nbsp;
## and predict the trend in differences:
d.simple$diff <- predict(l.simple, data.frame(nx=d$x, ny=d$y))
d.poly$diff <- predict(l.poly, data.frame(nx=d$x, ny=d$y))

&nbsp;
## stack for grouped plotting
library(lattice)
library(RColorBrewer)
d.combined <- make.groups(simple=d.simple, poly=d.poly)

&nbsp;
## plot:
wireframe(diff ~ x * y | which, data=d.combined, scales=list(arrows=TRUE\), drape=TRUE\, col.regions=brewer.pal(n=9, name='Oranges'), cuts=8, col='white')

Testing for linearity: Two visualizations of the deviance between coordinates positions, at the control point locations.Testing for linearity: Two visualizations of the deviance between coordinates positions, at the control point locations.

 
Conclusions:
It seems that a second order term was only warranted along the x-direction. The more complex model based on 3rd-order polynomials results in a significantly lower RMSE (about 10 meters lower), and is shown to be a better descriptor of variance in the test of nested models.

At the map scale in which the corrected data will be presented, the extra accuracy suggested by the more complex model (coordinate transformation function) is negligible. This allows for the simpler model, which can be directly used by the convenient ST_Affine() function in PostGIS for the heavy-lifting.