[Geodata] [Tiger] A few interesting observations on the
Tiger2007fedata
Stephen Woodbridge
woodbri at swoodbridge.com
Wed Jul 23 11:13:17 EDT 2008
Hi John,
These reports and analysis are really great! Sorry, that I have not been
able to contribute more, but I have been distracted by a client on
another project. I hope to get back to this and add my two cents into it
also.
I have also run into the address number range issues that you mentioned
and I like you idea of dropping the extraneous characters, I was just
throwing an error in my code.
Can you talk a little bit about your development environment. It looks
like you are using Perl. Are you using database as a backing store to
help with the processing? MySQL, postgresql, sqlite, other?
Thanks,
-Steve
John P. Linderman wrote:
> As I mentioned earlier in this thread, one aid to disambiguating
> street segments where address ranges are increasing on one side,
> but decreasing on the other, is to link segments for the same
> street together, and see how adjacent ranges line up. But
> sometimes a street enters an intersection from more than two
> directions. See, for example, 40.670528,-74.457660 in google
> maps, where Chaucer Dr forms a topological "lollipop", with houses
> on both sides of the loop and along the stem. If you come to the
> loop from the stem, and want to choose which edge to take out,
> the address ranges may be helpful in making the choice. For
> example, if the houses along the outside of the loop have odd
> addresses, and those on the inside, even addresses, then the
> "obvious" choice would be edge preserving the parity of the stem.
> That is, if the stem has odd edges on the right approaching the loop,
> then one would want to turn right, keeping the odd edges on the right
> (outside), and vice versa. Address numbers might also be a guide,
> trying to minimize the gap as one moves from edge to edge. But to
> measure gaps, or establish parity, there must be a "number",
> which is obvious when address ranges are purely numeric,
> but less obvious when there are non-digits involved.
> So, before worrying about linking edges together, I wanted to
> get a handle on the nature of individual ranges from the
> Address Ranges Relationship File.
>
> Here are some summary data for the entire Tiger2007 distribution.
>
> Address range grand totals
> All Digits: 70505410
> Mixed: 1608442
> No Digits: 36
> Total: 72113888
>
> Errors in all address ranges
> Errors: 14937
> Warnings: 742
> Clean: 36041265
> Total: 36056944
>
> There are (exactly) twice as many sample points in the "grand totals"
> summary as the error summary because each range has a TO and FROM
> address. There are very few addresses with no digits (a few each in
> MI, WI and PR). We can ignore them completely without much loss of
> generality. But there are enough with both digits and non-digits
> that we had best do something sensible. So I wrote some scripts to
> "extend" the basic record, attempting to add a FROM_num and TO_num
> field that is always digits only. For All Digits addresses, these
> are the same as the FROMHN and TOHN fields. If FROMHN and TOHN
> agree on everything but a single all-digit subsequence, that
> subsequence is a logical choice for the FROM_num and TO_num fields.
> When FROMHN and TOHN differ elsewhere, the errors and warnings
> start popping up. The difference between an error and a warning
> is not crystal clear; if I can see a simple way to adjust the
> range to "make sense", I do so, and it's a warning. If the range
> is pretty hopeless, it's an error. Here's an example of a warning
> (and the last time I'll include the entire "extended" record,
> where any field name including lower case letter is added by me,
> except for "_deleted", which is there in the original record).
>
> County 01005 (Barbour, AL), record 5468
> $record = {
> 'SIDE' => 'L',
> 'TO_num' => '6',
> '_deleted' => '',
> 'TLID' => '69077027',
> 'TOTYP' => '',
> 'FROMTYP' => 'I',
> 'PLUS4' => '',
> 'ARID' => '400541338686',
> 'warnings' => [
> 'trimmed 1 off 1F2'
> ],
> 'TOHN' => 'F6',
> 'FROM_parts' => [
> 'F',
> '2'
> ],
> 'FROMHN' => '1F2',
> 'FROM_num' => 2,
> 'MTFCC' => 'D1000',
> 'addresses' => 3,
> 'ZIP' => '36027',
> 'TO_parts' => [
> 'F',
> '6'
> ],
> 'parity' => 'E'
> };
>
> The original range was 1F2 => F6, a pattern (extraneous digits at the
> front of one address endpoint) that happens often enough (about 650
> times in the entire distribution) that it might (or might not) be worth
> correcting. I simply drop the extraneous digits, with a warning,
> yielding range F2 => F6, 3 addresses with Even parity.
>
> Another, less common, pattern is an extraneous - at the start of
> one address endpoint, 92 occurrences in the distribution. For example,
>
> County 10001 (Kent, DE), record 97
> 'TLID' => '68092276',
> 'ARID' => '400404723907',
> 'TOHN' => 'B9',
> 'FROMHN' => '-B1',
>
> Here the original range, -B1 => B9, gets converted to the reasonably
> obvious B1 => B9. After this correction, in all but about 50 cases,
> mixed from/to addresses agree on all the non-numeric components.
> One of the exceptions is
>
> County 72021 (Bayamon, PR), record 3144
> 'TLID' => '206027274',
> 'ARID' => '400583928652',
> 'TOHN' => 'OO-227',
> 'FROMHN' => 'O3',
>
> This range is so far off the wall that I can't think of any way
> to adjust it that isn't an outright guess. But losing 50 address
> ranges is certainly tolerable. By far the largest class of what
> I categorized as errors is mixed addresses differing on two or
> more numerical components, which occurred about 13200 times.
> All but 2 of these differed at the first and second numerical
> component. A typical instance is
>
> County 06037 (Los_Angeles, CA), record 34696
> 'TLID' => '141604200',
> 'ARID' => '4001117732741',
> 'TOHN' => '1318-9',
> 'FROMHN' => '1316-5',
>
> When the second components are the same length, as I believe is
> usually the case (but I'll have to check), it's not unreasonable
> to simply drop whatever separates the components, which would
> yield FROM_num => 13165 and TO_num => 13189, an Odd range
> having 13 addresses. Given any odd number in that range,
> we could reconstruct the "real" address by re-inserting the
> non-digit components, for example, 13171 => 1317-1.
> I'll probably do that, and turn the errors into warnings,
> but it's almost certainly going to mask some real errors, like
>
> County 06035 (Lassen, CA), record 3888
> 'TLID' => '126954239',
> 'ARID' => '400360492549',
> 'TOHN' => '708-402',
> 'FROMHN' => '463-500',
>
> It is improbable that there are really 122452 even addresses
> along the street. But we've seen preposterously large
> all-numeric ranges before in this thread, so maybe that
> should just be a warning of its own.
>
> I figured that there wouldn't be any parity errors, since
> that's so easy to check for, but there were nearly 1700
> in the distribution. For example
>
> County 55035 (Eau_Claire, WI), record 4214
> 'TLID' => '600641201',
> 'ARID' => '400696181628',
> 'TOHN' => '4253',
> 'FROMHN' => '4232',
>
> The FROMHN is even, the TOHN is odd. I don't believe this
> is supposed to happen, but I happen to like the ability to
> express the concept that all the numbers from 4232 through
> 4253 can appear. The US postal service data include a
> parity character, E, O or B, for Even, Odd or Both,
> in their address ranges, a scheme I prefer. However,
> assuming the endpoints really were intended to have the
> same parity, perhaps we can use the opposite side,
> or adjacent edges, to resolve the ambiguity, much as I
> hope to do for ambiguities about increasing/decreasing
> ranges.
>
> Next step: start linking adjacent edges. -- jpl
>
>
> _______________________________________________
> Geodata mailing list
> Geodata at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/geodata
More information about the Geodata
mailing list