[Geodata] [Tiger] A few interesting observations on the
Tiger2007fedata
Joe Bussell
joe.bussell at gmail.com
Wed Jul 23 16:54:13 EDT 2008
Have they added z-ordering to the Tiger data? I am still hoping to get a
routing application based on Tiger data, which requires that I can
disambiguate the overpasses from the roads they "touch"
Cordially,
Joe Bussell
On Wed, Jul 23, 2008 at 8:13 AM, Stephen Woodbridge <woodbri at swoodbridge.com>
wrote:
> Hi John,
>
> These reports and analysis are really great! Sorry, that I have not been
> able to contribute more, but I have been distracted by a client on another
> project. I hope to get back to this and add my two cents into it also.
>
> I have also run into the address number range issues that you mentioned and
> I like you idea of dropping the extraneous characters, I was just throwing
> an error in my code.
>
> Can you talk a little bit about your development environment. It looks like
> you are using Perl. Are you using database as a backing store to help with
> the processing? MySQL, postgresql, sqlite, other?
>
> Thanks,
> -Steve
>
> John P. Linderman wrote:
>
>> As I mentioned earlier in this thread, one aid to disambiguating
>> street segments where address ranges are increasing on one side,
>> but decreasing on the other, is to link segments for the same
>> street together, and see how adjacent ranges line up. But
>> sometimes a street enters an intersection from more than two
>> directions. See, for example, 40.670528,-74.457660 in google
>> maps, where Chaucer Dr forms a topological "lollipop", with houses
>> on both sides of the loop and along the stem. If you come to the
>> loop from the stem, and want to choose which edge to take out,
>> the address ranges may be helpful in making the choice. For
>> example, if the houses along the outside of the loop have odd
>> addresses, and those on the inside, even addresses, then the
>> "obvious" choice would be edge preserving the parity of the stem.
>> That is, if the stem has odd edges on the right approaching the loop,
>> then one would want to turn right, keeping the odd edges on the right
>> (outside), and vice versa. Address numbers might also be a guide,
>> trying to minimize the gap as one moves from edge to edge. But to
>> measure gaps, or establish parity, there must be a "number",
>> which is obvious when address ranges are purely numeric,
>> but less obvious when there are non-digits involved.
>> So, before worrying about linking edges together, I wanted to
>> get a handle on the nature of individual ranges from the
>> Address Ranges Relationship File.
>>
>> Here are some summary data for the entire Tiger2007 distribution.
>>
>> Address range grand totals
>> All Digits: 70505410
>> Mixed: 1608442
>> No Digits: 36
>> Total: 72113888
>>
>> Errors in all address ranges
>> Errors: 14937
>> Warnings: 742
>> Clean: 36041265
>> Total: 36056944
>>
>> There are (exactly) twice as many sample points in the "grand totals"
>> summary as the error summary because each range has a TO and FROM
>> address. There are very few addresses with no digits (a few each in
>> MI, WI and PR). We can ignore them completely without much loss of
>> generality. But there are enough with both digits and non-digits
>> that we had best do something sensible. So I wrote some scripts to
>> "extend" the basic record, attempting to add a FROM_num and TO_num
>> field that is always digits only. For All Digits addresses, these
>> are the same as the FROMHN and TOHN fields. If FROMHN and TOHN
>> agree on everything but a single all-digit subsequence, that
>> subsequence is a logical choice for the FROM_num and TO_num fields.
>> When FROMHN and TOHN differ elsewhere, the errors and warnings
>> start popping up. The difference between an error and a warning
>> is not crystal clear; if I can see a simple way to adjust the
>> range to "make sense", I do so, and it's a warning. If the range
>> is pretty hopeless, it's an error. Here's an example of a warning
>> (and the last time I'll include the entire "extended" record,
>> where any field name including lower case letter is added by me,
>> except for "_deleted", which is there in the original record).
>>
>> County 01005 (Barbour, AL), record 5468
>> $record = {
>> 'SIDE' => 'L',
>> 'TO_num' => '6',
>> '_deleted' => '',
>> 'TLID' => '69077027',
>> 'TOTYP' => '',
>> 'FROMTYP' => 'I',
>> 'PLUS4' => '',
>> 'ARID' => '400541338686',
>> 'warnings' => [
>> 'trimmed 1 off 1F2'
>> ],
>> 'TOHN' => 'F6',
>> 'FROM_parts' => [
>> 'F',
>> '2'
>> ],
>> 'FROMHN' => '1F2',
>> 'FROM_num' => 2,
>> 'MTFCC' => 'D1000',
>> 'addresses' => 3,
>> 'ZIP' => '36027',
>> 'TO_parts' => [
>> 'F',
>> '6'
>> ],
>> 'parity' => 'E'
>> };
>>
>> The original range was 1F2 => F6, a pattern (extraneous digits at the
>> front of one address endpoint) that happens often enough (about 650
>> times in the entire distribution) that it might (or might not) be worth
>> correcting. I simply drop the extraneous digits, with a warning,
>> yielding range F2 => F6, 3 addresses with Even parity.
>>
>> Another, less common, pattern is an extraneous - at the start of
>> one address endpoint, 92 occurrences in the distribution. For example,
>>
>> County 10001 (Kent, DE), record 97
>> 'TLID' => '68092276',
>> 'ARID' => '400404723907',
>> 'TOHN' => 'B9',
>> 'FROMHN' => '-B1',
>>
>> Here the original range, -B1 => B9, gets converted to the reasonably
>> obvious B1 => B9. After this correction, in all but about 50 cases,
>> mixed from/to addresses agree on all the non-numeric components.
>> One of the exceptions is
>>
>> County 72021 (Bayamon, PR), record 3144
>> 'TLID' => '206027274',
>> 'ARID' => '400583928652',
>> 'TOHN' => 'OO-227',
>> 'FROMHN' => 'O3',
>>
>> This range is so far off the wall that I can't think of any way
>> to adjust it that isn't an outright guess. But losing 50 address
>> ranges is certainly tolerable. By far the largest class of what
>> I categorized as errors is mixed addresses differing on two or
>> more numerical components, which occurred about 13200 times.
>> All but 2 of these differed at the first and second numerical
>> component. A typical instance is
>>
>> County 06037 (Los_Angeles, CA), record 34696
>> 'TLID' => '141604200',
>> 'ARID' => '4001117732741',
>> 'TOHN' => '1318-9',
>> 'FROMHN' => '1316-5',
>>
>> When the second components are the same length, as I believe is
>> usually the case (but I'll have to check), it's not unreasonable
>> to simply drop whatever separates the components, which would
>> yield FROM_num => 13165 and TO_num => 13189, an Odd range
>> having 13 addresses. Given any odd number in that range,
>> we could reconstruct the "real" address by re-inserting the
>> non-digit components, for example, 13171 => 1317-1.
>> I'll probably do that, and turn the errors into warnings,
>> but it's almost certainly going to mask some real errors, like
>>
>> County 06035 (Lassen, CA), record 3888
>> 'TLID' => '126954239',
>> 'ARID' => '400360492549',
>> 'TOHN' => '708-402',
>> 'FROMHN' => '463-500',
>>
>> It is improbable that there are really 122452 even addresses
>> along the street. But we've seen preposterously large
>> all-numeric ranges before in this thread, so maybe that
>> should just be a warning of its own.
>>
>> I figured that there wouldn't be any parity errors, since
>> that's so easy to check for, but there were nearly 1700
>> in the distribution. For example
>>
>> County 55035 (Eau_Claire, WI), record 4214
>> 'TLID' => '600641201',
>> 'ARID' => '400696181628',
>> 'TOHN' => '4253',
>> 'FROMHN' => '4232',
>>
>> The FROMHN is even, the TOHN is odd. I don't believe this
>> is supposed to happen, but I happen to like the ability to
>> express the concept that all the numbers from 4232 through
>> 4253 can appear. The US postal service data include a
>> parity character, E, O or B, for Even, Odd or Both,
>> in their address ranges, a scheme I prefer. However,
>> assuming the endpoints really were intended to have the
>> same parity, perhaps we can use the opposite side,
>> or adjacent edges, to resolve the ambiguity, much as I
>> hope to do for ambiguities about increasing/decreasing
>> ranges.
>>
>> Next step: start linking adjacent edges. -- jpl
>>
>>
>> _______________________________________________
>> Geodata mailing list
>> Geodata at lists.osgeo.org
>> http://lists.osgeo.org/mailman/listinfo/geodata
>>
>
> _______________________________________________
> Geodata mailing list
> Geodata at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/geodata
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osgeo.org/pipermail/geodata/attachments/20080723/66241c04/attachment.html
More information about the Geodata
mailing list