[Geodata] [Tiger] A few interesting observations on the Tiger2007fedata

Wed Jul 23 08:55:51 EDT 2008

As I mentioned earlier in this thread, one aid to disambiguating
street segments where address ranges are increasing on one side,
but decreasing on the other, is to link segments for the same
street together, and see how adjacent ranges line up.  But
sometimes a street enters an intersection from more than two
directions.  See, for example, 40.670528,-74.457660 in google
maps, where Chaucer Dr forms a topological "lollipop", with houses
on both sides of the loop and along the stem.  If you come to the
loop from the stem, and want to choose which edge to take out,
the address ranges may be helpful in making the choice.  For
example, if the houses along the outside of the loop have odd
addresses, and those on the inside, even addresses, then the
"obvious" choice would be edge preserving the parity of the stem.
That is, if the stem has odd edges on the right approaching the loop,
then one would want to turn right, keeping the odd edges on the right
(outside), and vice versa.  Address numbers might also be a guide,
trying to minimize the gap as one moves from edge to edge.  But to
measure gaps, or establish parity, there must be a "number",
which is obvious when address ranges are purely numeric,
but less obvious when there are non-digits involved.
So, before worrying about linking edges together, I wanted to
get a handle on the nature of individual ranges from the
Address Ranges Relationship File.

Here are some summary data for the entire Tiger2007 distribution.

    Address range grand totals
	All Digits: 70505410
	     Mixed:  1608442
	 No Digits:       36
	     Total: 72113888

    Errors in all address ranges
	  Errors:    14937
	Warnings:      742
	   Clean: 36041265
	   Total: 36056944

There are (exactly) twice as many sample points in the "grand totals"
summary as the error summary because each range has a TO and FROM
address.  There are very few addresses with no digits (a few each in
MI, WI and PR).  We can ignore them completely without much loss of
generality.  But there are enough with both digits and non-digits
that we had best do something sensible.  So I wrote some scripts to
"extend" the basic record, attempting to add a FROM_num and TO_num
field that is always digits only.  For All Digits addresses, these
are the same as the FROMHN and TOHN fields.  If FROMHN and TOHN
agree on everything but a single all-digit subsequence, that
subsequence is a logical choice for the FROM_num and TO_num fields.
When FROMHN and TOHN differ elsewhere, the errors and warnings
start popping up.  The difference between an error and a warning
is not crystal clear; if I can see a simple way to adjust the
range to "make sense", I do so, and it's a warning.  If the range
is pretty hopeless, it's an error.  Here's an example of a warning
(and the last time I'll include the entire "extended" record,
where any field name including lower case letter is added by me,
except for "_deleted", which is there in the original record).

County 01005 (Barbour, AL), record 5468
$record = {
            'SIDE' => 'L',
            'TO_num' => '6',
            '_deleted' => '',
            'TLID' => '69077027',
            'TOTYP' => '',
            'FROMTYP' => 'I',
            'PLUS4' => '',
            'ARID' => '400541338686',
            'warnings' => [
                            'trimmed 1 off 1F2'
                          ],
            'TOHN' => 'F6',
            'FROM_parts' => [
                              'F',
                              '2'
                            ],
            'FROMHN' => '1F2',
            'FROM_num' => 2,
            'MTFCC' => 'D1000',
            'addresses' => 3,
            'ZIP' => '36027',
            'TO_parts' => [
                            'F',
                            '6'
                          ],
            'parity' => 'E'
          };

The original range was 1F2 => F6, a pattern (extraneous digits at the
front of one address endpoint) that happens often enough (about 650
times in the entire distribution) that it might (or might not) be worth
correcting.  I simply drop the extraneous digits, with a warning,
yielding range F2 => F6, 3 addresses with Even parity.

Another, less common, pattern is an extraneous - at the start of
one address endpoint, 92 occurrences in the distribution.  For example,

County 10001 (Kent, DE), record 97
            'TLID' => '68092276',
            'ARID' => '400404723907',
            'TOHN' => 'B9',
            'FROMHN' => '-B1',

Here the original range, -B1 => B9, gets converted to the reasonably
obvious B1 => B9.  After this correction, in all but about 50 cases,
mixed from/to addresses agree on all the non-numeric components.
One of the exceptions is

County 72021 (Bayamon, PR), record 3144
            'TLID' => '206027274',
            'ARID' => '400583928652',
            'TOHN' => 'OO-227',
            'FROMHN' => 'O3',

This range is so far off the wall that I can't think of any way
to adjust it that isn't an outright guess.  But losing 50 address
ranges is certainly tolerable.  By far the largest class of what
I categorized as errors is mixed addresses differing on two or
more numerical components, which occurred about 13200 times.
All but 2 of these differed at the first and second numerical
component.  A typical instance is

County 06037 (Los_Angeles, CA), record 34696
            'TLID' => '141604200',
            'ARID' => '4001117732741',
            'TOHN' => '1318-9',
            'FROMHN' => '1316-5',

When the second components are the same length, as I believe is
usually the case (but I'll have to check), it's not unreasonable
to simply drop whatever separates the components, which would
yield FROM_num => 13165 and TO_num => 13189, an Odd range
having 13 addresses.  Given any odd number in that range,
we could reconstruct the "real" address by re-inserting the
non-digit components, for example, 13171 => 1317-1.
I'll probably do that, and turn the errors into warnings,
but it's almost certainly going to mask some real errors, like

County 06035 (Lassen, CA), record 3888
            'TLID' => '126954239',
            'ARID' => '400360492549',
            'TOHN' => '708-402',
            'FROMHN' => '463-500',

It is improbable that there are really 122452 even addresses
along the street.  But we've seen preposterously large
all-numeric ranges before in this thread, so maybe that
should just be a warning of its own.

I figured that there wouldn't be any parity errors, since
that's so easy to check for, but there were nearly 1700
in the distribution.  For example

County 55035 (Eau_Claire, WI), record 4214
            'TLID' => '600641201',
            'ARID' => '400696181628',
            'TOHN' => '4253',
            'FROMHN' => '4232',

The FROMHN is even, the TOHN is odd.  I don't believe this
is supposed to happen, but I happen to like the ability to
express the concept that all the numbers from 4232 through
4253 can appear.  The US postal service data include a
parity character, E, O or B, for Even, Odd or Both,
in their address ranges, a scheme I prefer.  However,
assuming the endpoints really were intended to have the
same parity, perhaps we can use the opposite side,
or adjacent edges, to resolve the ambiguity, much as I
hope to do for ambiguities about increasing/decreasing
ranges.

Next step: start linking adjacent edges.  -- jpl