[Geodata] [Tiger] A few interesting observations on the Tiger2007fedata

John P. Linderman jpl at research.att.com
Mon Jul 28 09:43:24 EDT 2008


An update on address ranges:

When I regarded any mixed range that differed on more than one
numeric component as an error, I saw

    Address range grand totals
	All Digits: 70505410
	     Mixed:  1608442
	 No Digits:       36
	     Total: 72113888

    Errors in all address ranges
	  Errors:    14937
	Warnings:      742
	   Clean: 36041265
	   Total: 36056944

The vast majority of the errors were caused by mixed addresses
differing one more than one numerical part.

I changed my interpretation to allow differences in more than
one numerical component, as long as the FROM and TO addresses
agreed on all non-numerical components, and the length of all
but the first numerical components were the same.  The reasons
for these restrictions are to make it possible to reconstruct
the mixed-component addresses from a purely numerical counterpart.
For example, given the range

            'TLID' => '63653257',
            'ARID' => '400679390810',
            'TOHN' => 'L5',
            'FROMHN' => 'LOT37',

it's ambiguous whether number 25 should be coverted to L25 or
to LOT25.  And, although I found no cases where the lengths
of numerical components differed except on the first component,
if there had been a range like

            'TOHN' => '1-000',
            'FROMHN' => '300-0',

then 2000 might be 2-000 or 200-0.  But if all the trailing
numerical components have the same length, then it's clear
how to peel off the appropriate number of digits from a
numerical input to repopulate the mixed-components form.
You get the single numerical equivalent of a mixed-mode
address by ignoring all the non-numerical components,
and concatenating the remaining numerical components.

As I was implementing a single-number-to-mixed-format routine,
I realized that one had to be careful to provide enough leading
0's so that all numerical components got a numerical value.
And, in an AHA moment, I wondered if failing to do that
might account for some of the bogus ranges where the FROM
and TO addresses had different numbers of components.
Given the conversion routine, it was easy enough to check.
Take the address with fewer components, putatively missing
some leading 0 components, take its numerical equivalent
by concatenating all the numerical components (so the
missing leading 0's won't influence the numerical
equivalent), reformat (properly) using the address with more
components, and see if you get the shorter address if you
remove the appropriate number of leading 0's.  For example,

            'TLID' => '69077027',
            'ARID' => '400541338686',
            'TOHN' => 'F6',
            'FROMHN' => '1F2',

The TO address has one less numerical component than the
FROM address.  Extract the numerical equivalent, 6,
from the short address, F6, by dropping all the non-digits.
A simple way of avoiding errors is to paste a bunch
of leading 0's onto the number to be reformatted,
so 6 => 06.  Reformat 06 using the the longer address, 1F2.
We work from right to left.  The final numerical component
of 1F2 (2) has length one, so peel one digit off the input value,
yielding 6 and leaving 0.  Paste the non-numerical component
(F) onto the front, yielding F6.  The next numerical component
of 1F2 is 1, and it's the only remaining numerical component,
so put the rest of the input there, yielding 0F6.
The original TO address had 1 fewer numerical component,
so see if the result had 1 leading 0 component.  Yep, 0F6 does.
Remove it, and see if you get the original address.
0F6 => F6, the original address.  So, accept the properly
converted address, 0F6, as the intended TO address.

In a previous posting, I special-cased extraneous leading
digits, and would have turned 1F2 into F2.  I like this
approach better.  There's less special-case code, and it
picks up many more ranges that I would otherwise have
rejected outright, like

            'TLID' => '191019110',
            'ARID' => '40045715742',
            'TOHN' => 'MI-2',
            'FROMHN' => 'MI1-6',

Here the "missing" numerical component isn't at the start,
it's hidden between an I and a dash.  But the approach
above works just fine, reformatting 2 as MI0-2 via the
longer address, then verifying that the deletion of the
leading 0 component matches the truncated address.
So we act as though TOHN has been MI0-2.

Using this approach, I saw

    Address range grand totals
	All Digits: 70505410
	     Mixed:  1608474
	 No Digits:        4
	     Total: 72113888
    Errors in all address ranges
	  Errors:     1549
	Warnings:      741
	   Clean: 36054654
	   Total: 36056944

Those paying (too much :-) attention may have noticed that
the grand totals for Mixed: and No Digits: changed.  That's
because digits were added to 32 addresses where none had
existed before, as in

            'TLID' => '181906924',
            'ARID' => '400176495890',
            'TOHN' => 'E1099',
            'FROMHN' => 'E',

where the new FROMHN is treated as E0.  The check-for-leading-0's
approach turn 663 errors into warnings.

The bulk of the remaining errors are parity errors, 1524 of them,
and even they were reduced (from just under 1700) by the new
interpretation of mixed numbers, although some new parity
errors were introduced, as the most recent example demonstrates.
Since there's hope for correcting parity errors "by context",
just as I hope to do for ambiguous ascending/descending ranges
on a single edge, these are less serious than just ignoring
the addresses altogether.  So I find the results enormously
encouraging.  There are just 25 or so "hopeless" address
ranges, many of which involve an extraneous -

            'TLID' => '204876438',
            'ARID' => '400582571260',
            'TOHN' => 'F-11824',
            'FROMHN' => 'F0',

or a doubled character

            'TLID' => '206226294',
            'ARID' => '40083493197',
            'TOHN' => 'J9',
            'FROMHN' => 'JJ115',

We could probably even devise simple rules for cleaning up
many of the remaining 25, but it would be guesswork of a less
systematic nature than the "missing leading 0's" model,
and there's many other things to worry about first.  -- jpl




More information about the Geodata mailing list