[Geodata] [Tiger] A few interesting observations on the Tiger2007fedata

Joe Bussell joe.bussell at gmail.com
Wed Jul 23 16:54:13 EDT 2008


Have they added z-ordering to the Tiger data?  I am still hoping to get a
routing application based on Tiger data, which requires that I can
disambiguate the overpasses from the roads they "touch"

Cordially,

Joe Bussell


On Wed, Jul 23, 2008 at 8:13 AM, Stephen Woodbridge <woodbri at swoodbridge.com>
wrote:

> Hi John,
>
> These reports and analysis are really great! Sorry, that I have not been
> able to contribute more, but I have been distracted by a client on another
> project. I hope to get back to this and add my two cents into it also.
>
> I have also run into the address number range issues that you mentioned and
> I like you idea of dropping the extraneous characters, I was just throwing
> an error in my code.
>
> Can you talk a little bit about your development environment. It looks like
> you are using Perl. Are you using database as a backing store to help with
> the processing? MySQL, postgresql, sqlite, other?
>
> Thanks,
>  -Steve
>
> John P. Linderman wrote:
>
>> As I mentioned earlier in this thread, one aid to disambiguating
>> street segments where address ranges are increasing on one side,
>> but decreasing on the other, is to link segments for the same
>> street together, and see how adjacent ranges line up.  But
>> sometimes a street enters an intersection from more than two
>> directions.  See, for example, 40.670528,-74.457660 in google
>> maps, where Chaucer Dr forms a topological "lollipop", with houses
>> on both sides of the loop and along the stem.  If you come to the
>> loop from the stem, and want to choose which edge to take out,
>> the address ranges may be helpful in making the choice.  For
>> example, if the houses along the outside of the loop have odd
>> addresses, and those on the inside, even addresses, then the
>> "obvious" choice would be edge preserving the parity of the stem.
>> That is, if the stem has odd edges on the right approaching the loop,
>> then one would want to turn right, keeping the odd edges on the right
>> (outside), and vice versa.  Address numbers might also be a guide,
>> trying to minimize the gap as one moves from edge to edge.  But to
>> measure gaps, or establish parity, there must be a "number",
>> which is obvious when address ranges are purely numeric,
>> but less obvious when there are non-digits involved.
>> So, before worrying about linking edges together, I wanted to
>> get a handle on the nature of individual ranges from the
>> Address Ranges Relationship File.
>>
>> Here are some summary data for the entire Tiger2007 distribution.
>>
>>    Address range grand totals
>>        All Digits: 70505410
>>             Mixed:  1608442
>>         No Digits:       36
>>             Total: 72113888
>>
>>    Errors in all address ranges
>>          Errors:    14937
>>        Warnings:      742
>>           Clean: 36041265
>>           Total: 36056944
>>
>> There are (exactly) twice as many sample points in the "grand totals"
>> summary as the error summary because each range has a TO and FROM
>> address.  There are very few addresses with no digits (a few each in
>> MI, WI and PR).  We can ignore them completely without much loss of
>> generality.  But there are enough with both digits and non-digits
>> that we had best do something sensible.  So I wrote some scripts to
>> "extend" the basic record, attempting to add a FROM_num and TO_num
>> field that is always digits only.  For All Digits addresses, these
>> are the same as the FROMHN and TOHN fields.  If FROMHN and TOHN
>> agree on everything but a single all-digit subsequence, that
>> subsequence is a logical choice for the FROM_num and TO_num fields.
>> When FROMHN and TOHN differ elsewhere, the errors and warnings
>> start popping up.  The difference between an error and a warning
>> is not crystal clear; if I can see a simple way to adjust the
>> range to "make sense", I do so, and it's a warning.  If the range
>> is pretty hopeless, it's an error.  Here's an example of a warning
>> (and the last time I'll include the entire "extended" record,
>> where any field name including lower case letter is added by me,
>> except for "_deleted", which is there in the original record).
>>
>> County 01005 (Barbour, AL), record 5468
>> $record = {
>>            'SIDE' => 'L',
>>            'TO_num' => '6',
>>            '_deleted' => '',
>>            'TLID' => '69077027',
>>            'TOTYP' => '',
>>            'FROMTYP' => 'I',
>>            'PLUS4' => '',
>>            'ARID' => '400541338686',
>>            'warnings' => [
>>                            'trimmed 1 off 1F2'
>>                          ],
>>            'TOHN' => 'F6',
>>            'FROM_parts' => [
>>                              'F',
>>                              '2'
>>                            ],
>>            'FROMHN' => '1F2',
>>            'FROM_num' => 2,
>>            'MTFCC' => 'D1000',
>>            'addresses' => 3,
>>            'ZIP' => '36027',
>>            'TO_parts' => [
>>                            'F',
>>                            '6'
>>                          ],
>>            'parity' => 'E'
>>          };
>>
>> The original range was 1F2 => F6, a pattern (extraneous digits at the
>> front of one address endpoint) that happens often enough (about 650
>> times in the entire distribution) that it might (or might not) be worth
>> correcting.  I simply drop the extraneous digits, with a warning,
>> yielding range F2 => F6, 3 addresses with Even parity.
>>
>> Another, less common, pattern is an extraneous - at the start of
>> one address endpoint, 92 occurrences in the distribution.  For example,
>>
>> County 10001 (Kent, DE), record 97
>>            'TLID' => '68092276',
>>            'ARID' => '400404723907',
>>            'TOHN' => 'B9',
>>            'FROMHN' => '-B1',
>>
>> Here the original range, -B1 => B9, gets converted to the reasonably
>> obvious B1 => B9.  After this correction, in all but about 50 cases,
>> mixed from/to addresses agree on all the non-numeric components.
>> One of the exceptions is
>>
>> County 72021 (Bayamon, PR), record 3144
>>            'TLID' => '206027274',
>>            'ARID' => '400583928652',
>>            'TOHN' => 'OO-227',
>>            'FROMHN' => 'O3',
>>
>> This range is so far off the wall that I can't think of any way
>> to adjust it that isn't an outright guess.  But losing 50 address
>> ranges is certainly tolerable.  By far the largest class of what
>> I categorized as errors is mixed addresses differing on two or
>> more numerical components, which occurred about 13200 times.
>> All but 2 of these differed at the first and second numerical
>> component.  A typical instance is
>>
>> County 06037 (Los_Angeles, CA), record 34696
>>            'TLID' => '141604200',
>>            'ARID' => '4001117732741',
>>            'TOHN' => '1318-9',
>>            'FROMHN' => '1316-5',
>>
>> When the second components are the same length, as I believe is
>> usually the case (but I'll have to check), it's not unreasonable
>> to simply drop whatever separates the components, which would
>> yield FROM_num => 13165 and TO_num => 13189, an Odd range
>> having 13 addresses.  Given any odd number in that range,
>> we could reconstruct the "real" address by re-inserting the
>> non-digit components, for example, 13171 => 1317-1.
>> I'll probably do that, and turn the errors into warnings,
>> but it's almost certainly going to mask some real errors, like
>>
>> County 06035 (Lassen, CA), record 3888
>>            'TLID' => '126954239',
>>            'ARID' => '400360492549',
>>            'TOHN' => '708-402',
>>            'FROMHN' => '463-500',
>>
>> It is improbable that there are really 122452 even addresses
>> along the street.  But we've seen preposterously large
>> all-numeric ranges before in this thread, so maybe that
>> should just be a warning of its own.
>>
>> I figured that there wouldn't be any parity errors, since
>> that's so easy to check for, but there were nearly 1700
>> in the distribution.  For example
>>
>> County 55035 (Eau_Claire, WI), record 4214
>>            'TLID' => '600641201',
>>            'ARID' => '400696181628',
>>            'TOHN' => '4253',
>>            'FROMHN' => '4232',
>>
>> The FROMHN is even, the TOHN is odd.  I don't believe this
>> is supposed to happen, but I happen to like the ability to
>> express the concept that all the numbers from 4232 through
>> 4253 can appear.  The US postal service data include a
>> parity character, E, O or B, for Even, Odd or Both,
>> in their address ranges, a scheme I prefer.  However,
>> assuming the endpoints really were intended to have the
>> same parity, perhaps we can use the opposite side,
>> or adjacent edges, to resolve the ambiguity, much as I
>> hope to do for ambiguities about increasing/decreasing
>> ranges.
>>
>> Next step: start linking adjacent edges.  -- jpl
>>
>>
>> _______________________________________________
>> Geodata mailing list
>> Geodata at lists.osgeo.org
>> http://lists.osgeo.org/mailman/listinfo/geodata
>>
>
> _______________________________________________
> Geodata mailing list
> Geodata at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/geodata
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osgeo.org/pipermail/geodata/attachments/20080723/66241c04/attachment.html


More information about the Geodata mailing list