[Geodata] [Tiger] A few interesting observations on the Tiger2007fedata

Wed Jul 23 11:13:17 EDT 2008

Hi John,

These reports and analysis are really great! Sorry, that I have not been 
able to contribute more, but I have been distracted by a client on 
another project. I hope to get back to this and add my two cents into it 
also.

I have also run into the address number range issues that you mentioned 
and I like you idea of dropping the extraneous characters, I was just 
throwing an error in my code.

Can you talk a little bit about your development environment. It looks 
like you are using Perl. Are you using database as a backing store to 
help with the processing? MySQL, postgresql, sqlite, other?

Thanks,
   -Steve

John P. Linderman wrote:
> As I mentioned earlier in this thread, one aid to disambiguating
> street segments where address ranges are increasing on one side,
> but decreasing on the other, is to link segments for the same
> street together, and see how adjacent ranges line up.  But
> sometimes a street enters an intersection from more than two
> directions.  See, for example, 40.670528,-74.457660 in google
> maps, where Chaucer Dr forms a topological "lollipop", with houses
> on both sides of the loop and along the stem.  If you come to the
> loop from the stem, and want to choose which edge to take out,
> the address ranges may be helpful in making the choice.  For
> example, if the houses along the outside of the loop have odd
> addresses, and those on the inside, even addresses, then the
> "obvious" choice would be edge preserving the parity of the stem.
> That is, if the stem has odd edges on the right approaching the loop,
> then one would want to turn right, keeping the odd edges on the right
> (outside), and vice versa.  Address numbers might also be a guide,
> trying to minimize the gap as one moves from edge to edge.  But to
> measure gaps, or establish parity, there must be a "number",
> which is obvious when address ranges are purely numeric,
> but less obvious when there are non-digits involved.
> So, before worrying about linking edges together, I wanted to
> get a handle on the nature of individual ranges from the
> Address Ranges Relationship File.
> 
> Here are some summary data for the entire Tiger2007 distribution.
> 
>     Address range grand totals
> 	All Digits: 70505410
> 	     Mixed:  1608442
> 	 No Digits:       36
> 	     Total: 72113888
> 
>     Errors in all address ranges
> 	  Errors:    14937
> 	Warnings:      742
> 	   Clean: 36041265
> 	   Total: 36056944
> 
> There are (exactly) twice as many sample points in the "grand totals"
> summary as the error summary because each range has a TO and FROM
> address.  There are very few addresses with no digits (a few each in
> MI, WI and PR).  We can ignore them completely without much loss of
> generality.  But there are enough with both digits and non-digits
> that we had best do something sensible.  So I wrote some scripts to
> "extend" the basic record, attempting to add a FROM_num and TO_num
> field that is always digits only.  For All Digits addresses, these
> are the same as the FROMHN and TOHN fields.  If FROMHN and TOHN
> agree on everything but a single all-digit subsequence, that
> subsequence is a logical choice for the FROM_num and TO_num fields.
> When FROMHN and TOHN differ elsewhere, the errors and warnings
> start popping up.  The difference between an error and a warning
> is not crystal clear; if I can see a simple way to adjust the
> range to "make sense", I do so, and it's a warning.  If the range
> is pretty hopeless, it's an error.  Here's an example of a warning
> (and the last time I'll include the entire "extended" record,
> where any field name including lower case letter is added by me,
> except for "_deleted", which is there in the original record).
> 
> County 01005 (Barbour, AL), record 5468
> $record = {
>             'SIDE' => 'L',
>             'TO_num' => '6',
>             '_deleted' => '',
>             'TLID' => '69077027',
>             'TOTYP' => '',
>             'FROMTYP' => 'I',
>             'PLUS4' => '',
>             'ARID' => '400541338686',
>             'warnings' => [
>                             'trimmed 1 off 1F2'
>                           ],
>             'TOHN' => 'F6',
>             'FROM_parts' => [
>                               'F',
>                               '2'
>                             ],
>             'FROMHN' => '1F2',
>             'FROM_num' => 2,
>             'MTFCC' => 'D1000',
>             'addresses' => 3,
>             'ZIP' => '36027',
>             'TO_parts' => [
>                             'F',
>                             '6'
>                           ],
>             'parity' => 'E'
>           };
> 
> The original range was 1F2 => F6, a pattern (extraneous digits at the
> front of one address endpoint) that happens often enough (about 650
> times in the entire distribution) that it might (or might not) be worth
> correcting.  I simply drop the extraneous digits, with a warning,
> yielding range F2 => F6, 3 addresses with Even parity.
> 
> Another, less common, pattern is an extraneous - at the start of
> one address endpoint, 92 occurrences in the distribution.  For example,
> 
> County 10001 (Kent, DE), record 97
>             'TLID' => '68092276',
>             'ARID' => '400404723907',
>             'TOHN' => 'B9',
>             'FROMHN' => '-B1',
> 
> Here the original range, -B1 => B9, gets converted to the reasonably
> obvious B1 => B9.  After this correction, in all but about 50 cases,
> mixed from/to addresses agree on all the non-numeric components.
> One of the exceptions is
> 
> County 72021 (Bayamon, PR), record 3144
>             'TLID' => '206027274',
>             'ARID' => '400583928652',
>             'TOHN' => 'OO-227',
>             'FROMHN' => 'O3',
> 
> This range is so far off the wall that I can't think of any way
> to adjust it that isn't an outright guess.  But losing 50 address
> ranges is certainly tolerable.  By far the largest class of what
> I categorized as errors is mixed addresses differing on two or
> more numerical components, which occurred about 13200 times.
> All but 2 of these differed at the first and second numerical
> component.  A typical instance is
> 
> County 06037 (Los_Angeles, CA), record 34696
>             'TLID' => '141604200',
>             'ARID' => '4001117732741',
>             'TOHN' => '1318-9',
>             'FROMHN' => '1316-5',
> 
> When the second components are the same length, as I believe is
> usually the case (but I'll have to check), it's not unreasonable
> to simply drop whatever separates the components, which would
> yield FROM_num => 13165 and TO_num => 13189, an Odd range
> having 13 addresses.  Given any odd number in that range,
> we could reconstruct the "real" address by re-inserting the
> non-digit components, for example, 13171 => 1317-1.
> I'll probably do that, and turn the errors into warnings,
> but it's almost certainly going to mask some real errors, like
> 
> County 06035 (Lassen, CA), record 3888
>             'TLID' => '126954239',
>             'ARID' => '400360492549',
>             'TOHN' => '708-402',
>             'FROMHN' => '463-500',
> 
> It is improbable that there are really 122452 even addresses
> along the street.  But we've seen preposterously large
> all-numeric ranges before in this thread, so maybe that
> should just be a warning of its own.
> 
> I figured that there wouldn't be any parity errors, since
> that's so easy to check for, but there were nearly 1700
> in the distribution.  For example
> 
> County 55035 (Eau_Claire, WI), record 4214
>             'TLID' => '600641201',
>             'ARID' => '400696181628',
>             'TOHN' => '4253',
>             'FROMHN' => '4232',
> 
> The FROMHN is even, the TOHN is odd.  I don't believe this
> is supposed to happen, but I happen to like the ability to
> express the concept that all the numbers from 4232 through
> 4253 can appear.  The US postal service data include a
> parity character, E, O or B, for Even, Odd or Both,
> in their address ranges, a scheme I prefer.  However,
> assuming the endpoints really were intended to have the
> same parity, perhaps we can use the opposite side,
> or adjacent edges, to resolve the ambiguity, much as I
> hope to do for ambiguities about increasing/decreasing
> ranges.
> 
> Next step: start linking adjacent edges.  -- jpl
> 
> 
> _______________________________________________
> Geodata mailing list
> Geodata at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/geodata