[Geodata] [Tiger] A few interesting observations on the Tiger2007fedata

Tue Jul 8 14:14:56 EDT 2008

On Mon, 30 Jun 2008
Stephen Woodbridge <woodbri at swoodbridge.com> said
in part, and with some editing on my part:

> It would be great if someone from the US Census would monitor
> this list.  I'll have to see if I can find anyone that might be
> interested.

I got a nice reply in response to my offer to have someone at
TIGER/Line monitor this list and participate in the design of
a "collected corrections" database.

    I'll pass the link below on to anyone here in the Census
    Bureau's Geography Division who is involved with the
    technical development of the TIGER/Line shapefiles.  While
    I cannot guarantee that anyone will visit or monitor it, I
    will let them know that it appears to contain information
    and comments which could help us design future product
    releases.  As for the set-up of the database and the offer
    to provide us with comments collected via it, I will also
    forward this on to the TIGER/Line shapefiles technical
    development staff to see if they are interested.  If they
    are, I will ask them to get in touch with you directly via
    your email at:  jpl at research.att.com.

So the ball is in their court if they want to participate.
If I hear anything else from them, I'll share it here
(unless they ask me not to).

> One problem on interpreting the data for geocoding will be how to
> decide what the various address ranges mean. ie: so they overlap
> each other are the stacked like |-- range 1 --+--- range 2 --|,
> or are they just random house numbers that sit on the street
> given some other range along the street, etc.

Given that there is no explicit order on ranges when multiple
ranges apply to the same side of an edge, and given that there
is no way to express "all addresses from 1 through 100" as a
single range, I think we just interpret ranges in the way that
"makes most sense".

I apologize in advance for the all the detail that follows.
I want to give examples of problems and possible solutions.
I found it clumsy to have to consult different files to get all
the information I wanted, so I built a summary file containing
much of the information needed for geocoding (and providing
examples).

    state abbreviation
    place name(s)
    FULLNAME
    address range information
    TLID
    from and to endpoints

e.g.

NJ|Absecon,Pleasantville|N Main St|1500;1514;08232;R+1501;1515;08232;L|20248657
9|39.413502,-74.503823 39.414227,-74.503362
NJ|McGuire AFB|W Spaatz Dr|3901;39329;08641;L+3914;3946;08641;R|134042726|40.04
8055,-74.588625 40.044116,-74.582959

Place names were obtained via the PLACEFP field from the Topological
Faces Relationship file, pushed through the state-level Current Place
shapefile.  If the names differ on opposite sides, both are shown,
separated by commas.

Address range is ;-separated from, to, zip and side from the
Address Ranges relationship file.  Multiple ranges are
separated by + signs.

Endpoints are the first and last points from the edges shapefile,
in a format that can be cut-and-pasted into google maps.
I can search through the file to see who shares endpoints.
For example, the W Spaatz Dr endpoints are shared by

NJ|McGuire AFB|W Scott St|3901;39171;08641;L+3916;3946;08641;R|134042732|40.048
515,-74.586238 40.048055,-74.588625
NJ|McGuire AFB|W Scott St||134080919|40.047608,-74.590805 40.048055,-74.588625

and

NJ|McGuire AFB|S Bolling Blvd|3801;3819;08641;L+3900;3914;08641;R+3802;3814;086
41;L|134042733|40.048515,-74.586238 40.044116,-74.582959
NJ|McGuire AFB|S Bolling Blvd||134042780|40.044116,-74.582959 
40.043389,-74.582686

from which we can conclude that there are no other edges for Spaatz
adjacent to this one.  If you enter any of the endpoints into google
maps, you can see how things are connected.  (The alignment of endpoints
with google maps is vastly improved over the older flat-file distributions).

I have some statistical breakdowns of points per edge and
address ranges per edge from other runs, for example

    Edges from Bergen county NJ: 58300
    Points per edge
    58300 samples, min 2, max 293, mean 4.270, stdv 5.987
    Points per edge, roads only
    43927 samples, min 2, max 119, mean 3.555, stdv 3.486

    138582 address ranges, 5874 mixed, 0 non-numeric
    35027 road edges have address ranges
    2 non-road edges have address ranges
    Left side ranges
    32924 samples, min 1, max 50, mean 1.054, stdv 0.496
    Right side ranges
    32731 samples, min 1, max 103, mean 1.057, stdv 0.719

and, for all of New Jersey, with additional detail,

    Right side range summary
    326346 samples, min 1, max 103, mean 1.075, stdv 0.667

    range count   number in                running
    bucket size     bucket  ranges fraction total
	    1       312325  312325  0.957   0.957
	    2        10156   20312  0.031   0.988
	    3         2055    6165  0.006   0.994
	    4          773    3092  0.002   0.997
	    5          340    1700  0.001   0.998
	    6          227    1362  0.001   0.999
	    7          107     749  0.000   0.999
	    8           79     632  0.000   0.999
	    9           64     576  0.000   0.999
	   10           51     510  0.000   0.999
	   11           34     374  0.000   1.000
	   12           27     324  0.000   1.000
	   13           17     221  0.000   1.000
	   14            7      98  0.000   1.000
      15 - 19           37     604  0.000   1.000
      20 - 24           23     486  0.000   1.000
      25 - 49           17     540  0.000   1.000
      50 - 99            5     383  0.000   1.000
    100 - 249            2     206  0.000   1.000

(The left side summary is similar).  The important thing is that
more than 95% of the edges having ranges have only 1 range per
side, so if we can deal with those correctly, we'll have something
useful.  Still, it'd be nice to handle the other 4.3% well.
And I'm not sure how typical New Jersey is.

The other statistic of interest is

    1403918 address ranges, 11671 mixed, 0 non-numeric

"Mixed" means there are both digits and non-digits,
"non-numeric" means there are no digits at all.
The others, the vast preponderance, are all numeric,
which is good, because fancier addresses are harder
to reason about.  (What is the maximum of PH2 and 1PH0,
two addresses that not only occur, but occur as a
single range?)  So I'll mostly restrict comments to
purely numerical addresses.

If geocoders (or hapless drivers, trying to find a given address
number on an unfamiliar street) controlled how addresses could
appear along edges, we'd probably agree on the following:

1) Addresses on one side of an edge should occur in increasing
or decreasing order, but not both (monotonicity).

2) If addresses occur on both sides of an edge, all odd
addresses should occur on one side, even addresses on the other,
(parity) and both should be increasing or both decreasing.

If that were the case, as soon as we saw addresses on the same
side of an edge, we'd know in which direction any address with
the same parity must be, or, if the addresses had different
parity, where any address must be.

It would be nice if this could be extended from an edge to an
entire street, but this is complicated by the possibility that a
street may leave a given intersection in more than 2 directions.
For example

NJ|Berkeley Heights|Chaucer Dr|1;339;07922;R+110;324;07922;L|60602636|40.670731
, -74.457430 40.670528,-74.457660

has three edges leaving 40.670731,-74.457430, two forming a loop with
addresses on both sides, the third being an edge entering the loop.
It's not so obvious how one would want to constrain address ranges
in such a case beyond the constraints on individual edges.
However, for the (far more common) case where all the edges
for a given street form a simple line, the constraints on parity
and monotonicity are reasonable (and commonly observed) for the
entire street.

A further advantage of monotonicity of address ranges along a street
is that the gaps, if any, between ranges also occur monotonically.
One way to "geocode the gaps" would be to treat them exactly as we
treat address ranges, but that can cause distortions when the gaps
are large relative to the address ranges.  For example

NJ||Columbus Rd|21900;21998;08505;R+2100;2198;08505;R+2101;2199;08505;L+21801;2
1999;08505;L|134097496|40.079818,-74.773183 40.079385,-74.768259

has a gap between 2198 and 21900.  (This could be dirty data, but
the USPS shows gaps of similar size on Columbus Road in Bordentown, NJ).
The gap is much larger than the ranges it separates, so if we treated
gaps in the same way treat ranges, the real addresses would be
relegated to the extreme ends of the segment.  It would seem preferable
to "collapse" the entire gap to a single address, say its least address,
so the entire gap would geocode to a single position distinct from the
actual ranges, but in the proper position relative to the ranges.

In summary, if we find address ranges along a simple linear street
that violate parity or monotonicity assumptions, we might want to
adjust parity or range order to enforce them.

In addition to violations of parity or monotonicity,
there is another range "warning flag", a range that is
"suspiciously large".  Note, given the Columbus Rd example
above, that overall range is more likely to raise false
positives than individual ranges.  I somewhat arbitrarily set
the limit on individual range spans to 5000, and came up with
122 New Jersey ranges larger than that.

If I look at the first suspiciously large range to come up,

NJ|Mays Landing|6th St|5998;18;08330;R+5999;1;08330;L|202495920|39.455843,-74.7
23059 39.456494,-74.725121

we have addresses 5998-18 on the right, and 5999-1 on the left.
If I go off to
  http://www.zillow.com
(one of my favorite sanity check websites) and search for
  5913 6th st     in    08330
and click on the blue diamonds along 6th to see house numbers,
they are all in the range 5900 to 5999.  And if I consult
US Postal Service data (for which they charge, so I can't
release it in bulk), I see addresses from 5900 to 5999,
even numbers with one ZIP+4 addon (2104), odd numbers with
another (2103).  Now, the TIGER/Line documentation,
TGRSHP07.pdf, specifies on the bottom of page 3-72 and top
of 3-74, pages 90-92 of 131, that Tiger includes addresses
that the USPS does not.  In fact, my interest in the Tiger
data started with the wish to extend the USPS data with
such addresses.  But the USPS has information for 6th st,
so it's not a case of general delivery only.  If I had to
bet, I'd bet that the Tiger range was wrong in this case,
and the USPS range should be used instead.

Here's another example
NJ|McGuire AFB|W Spaatz Dr|3901;39329;08641;L+3914;3946;08641;R|134042726|40.04
8055,-74.588625 40.044116,-74.582959

On the right side, we have the entirely reasonable range 3914-3946,
and, on the left, the totally unreasonable range 3901-39329.
The USPS has 95 different ZIP+4 addons for Spaatz Dr (no W directional).
Some are in the all-numerical ranges 3830-3899 and 4264-4599,
many others in ranges with alphabetical suffixes, like 3801A-3818A.
This being part of an air force base, zillow.com won't have any
house prices (or numbers), but google maps shows what looks like
large apartment complexes lining the street, consistent with the
USPS numbers.  In this case, even without the USPS data, it would
be reasonable to "trim" the left side upper limit to perhaps 3999,
based on the right side range.  If you enter
  3999 W Spaatz Dr, 08641
into google maps, you'll see a
  Placement on map is approximate
message in the balloon text, a sign (to me) that google isn't too
happy with the address ranges they are using.  And they call it
  W Spaatz Dr
(with the W directional), even though the USPS uses no directional.
This is another sign to me that google is using the information
originally derived from TIGER/Line.  I thought perhaps they would
handle even addresses better, but they, too, get the approximate
warning.  Every address I tried did, and all were assigned the
same location.

This raises the question of what the ideal geocoder would do,
assuming the Tiger data were flawless.  At the bottom of page
3-74, page 92 of 131, Tiger notes that they have edited for
address range "overlaps", using full street name, and ZIP
(and, although they don't mention it, probably place name,
since the same ZIP may appear in different cities) to define
the scope where overlap is forbidden.  This squares with the
USPS addresses, which may repeat on a given street and a given
city, if ZIP is ignored.  Under the flawlessness assumption
(which Tiger admits is not valid), there are non-overlapping
blocks of addresses along both sides of the street+ZIP edges.
If we take the minimum and maximum addresses on a side,
than any address less than the minimum or greater than
the maximum cannot really appear on the street.  But we
don't want to "disallow" such addresses.  We'd at least like
to return a point *somewhere* on the street, perhaps
adjacent to the minimum or maximum address points.
And the same applies to addresses between the minimum
and maximum, but in a "gap" not covered by any range.

Let's turn to violations of parity and monotonicity, and consider
how we might decide to correct them.  If there are multiple
ranges on one side, and monotonicity is violated, we can use
the other side to correct the violation, if it is monotone.
For example,
NJ|Somers Point|W Maryland Ave|601;699;08244;R+798;600;08244;L+799;701;08244;R|
202482289|39.322896,-74.596432 39.324190,-74.599154
The right side has (increasing) range 601-699 and (decreasing)
range 799-701.  But the left side is unambiguously decreasing,
798-600, (and nicely "aligned" with the right side values),
so I would coerce both right side ranges to decrease as well.
I found over 1900 TLIDs in New Jersey where mixed increasing/decreasing
ranges on one side could be "corrected" by the other.  This left about
1400 sides (not TLIDs, since some TLIDs have ambiguous orderings on
both sides) that couldn't be adjusted using just the opposite side.

A more powerful, but more complicated, approach is to look for
adjacent edges, and use them to correct problems.  For example

NJ|Hammonton|Taylor Ave|2;6;08037;R+12;16;08037;R+49;21;08037;L+5;1;08037;L+15;
11;08037;L+22;50;08037;R|202477830|39.660700,-74.752214 39.660181,-74.753618
NJ|Hammonton|Taylor Ave|52;60;08037;R+59;51;08037;L|202477829|39.660181,-74.753
618 39.660029,-74.754106
NJ|Hammonton|Taylor Ave|62;70;08037;R+69;61;08037;L|202477826|39.660029,-74.754
106 39.659769,-74.754528

This is easier to understand via a schematic, where the edges have
been connected left to right, with left side addresses at the top,
and right side ranges at the bottom.

49-21  5-1    59-51   69-61
------------+-------+------
 2-6  12-16   52-60   62-70

Each side of each edge is self-consistent, but in opposition to the
other edge, so we can't determine which side is reversed from the
edge itself.  But collectively, it is quite obvious that the left
(top) side addresses are reversed.  If they were correct, we'd have
decreasing addresses suddenly increasing as we move from edge to edge.
If we reverse them, all addresses smoothly increase along the entire
street.  I haven't (yet) written the code necessary to hook the
edges of a given street together, so I can't say how often this
enables us to "correct violations" of monotonicity or parity.

So, it looks like there are fairly straightforward ways to
enforce monotonicity and parity using just the TIGER/Line data.
And I can imagine using the USPS data I have to do checks for
suspciously large ranges, and maybe even adding the missing
ZIP+4 information, although this would probably require the
approval of the US Postal Service, since it represents a
major release of information not in the public domain, and
undercuts their TIGER-related product

  http://www.usps.com/ncsc/addressmgmt/tiger.htm

(but I can't seem to find pricing information for that product). -- jpl