[Geodata] [Tiger] A few interesting observations on the Tiger2007fedata

Stephen Woodbridge woodbri at swoodbridge.com
Wed Jul 23 18:14:38 EDT 2008


No that is not there and I do not think it is likely that Census will 
add it, but you never know. They were/are supposed to be adding node ID 
to the data so that we would have persistent NodeID which would help 
anyone trying to use the data in graph models.

You might want to look at OpenStreetMap, they have loaded all the 
Tiger2006se data and I think people have started updating that data. I 
know in other countries they have added the zlevels to the data and 
people are using it for routing. But my guess is the the US in general 
is aways away from have that close to done.

I have thought about this quite a lot and think that it might be 
possible to write a graph traversal algorithm that would analyze things 
like interstates (basically A1? roads) and based on geometry and ramps 
and intersecting roads, assign some zlevels, and oneway flags. which 
would be a huge start at solving this problem and might make it 
marginally useful. But like most of us, I have not had the time to 
tackle have the ideas I have. :)

-Steve

Joe Bussell wrote:
> Have they added z-ordering to the Tiger data?  I am still hoping to get 
> a routing application based on Tiger data, which requires that I can 
> disambiguate the overpasses from the roads they "touch"
> 
> Cordially,
> 
> Joe Bussell
> 
> 
> On Wed, Jul 23, 2008 at 8:13 AM, Stephen Woodbridge 
> <woodbri at swoodbridge.com <mailto:woodbri at swoodbridge.com>> wrote:
> 
>     Hi John,
> 
>     These reports and analysis are really great! Sorry, that I have not
>     been able to contribute more, but I have been distracted by a client
>     on another project. I hope to get back to this and add my two cents
>     into it also.
> 
>     I have also run into the address number range issues that you
>     mentioned and I like you idea of dropping the extraneous characters,
>     I was just throwing an error in my code.
> 
>     Can you talk a little bit about your development environment. It
>     looks like you are using Perl. Are you using database as a backing
>     store to help with the processing? MySQL, postgresql, sqlite, other?
> 
>     Thanks,
> 
>      -Steve
> 
>     John P. Linderman wrote:
> 
>         As I mentioned earlier in this thread, one aid to disambiguating
>         street segments where address ranges are increasing on one side,
>         but decreasing on the other, is to link segments for the same
>         street together, and see how adjacent ranges line up.  But
>         sometimes a street enters an intersection from more than two
>         directions.  See, for example, 40.670528,-74.457660 in google
>         maps, where Chaucer Dr forms a topological "lollipop", with houses
>         on both sides of the loop and along the stem.  If you come to the
>         loop from the stem, and want to choose which edge to take out,
>         the address ranges may be helpful in making the choice.  For
>         example, if the houses along the outside of the loop have odd
>         addresses, and those on the inside, even addresses, then the
>         "obvious" choice would be edge preserving the parity of the stem.
>         That is, if the stem has odd edges on the right approaching the
>         loop,
>         then one would want to turn right, keeping the odd edges on the
>         right
>         (outside), and vice versa.  Address numbers might also be a guide,
>         trying to minimize the gap as one moves from edge to edge.  But to
>         measure gaps, or establish parity, there must be a "number",
>         which is obvious when address ranges are purely numeric,
>         but less obvious when there are non-digits involved.
>         So, before worrying about linking edges together, I wanted to
>         get a handle on the nature of individual ranges from the
>         Address Ranges Relationship File.
> 
>         Here are some summary data for the entire Tiger2007 distribution.
> 
>            Address range grand totals
>                All Digits: 70505410
>                     Mixed:  1608442
>                 No Digits:       36
>                     Total: 72113888
> 
>            Errors in all address ranges
>                  Errors:    14937
>                Warnings:      742
>                   Clean: 36041265
>                   Total: 36056944
> 
>         There are (exactly) twice as many sample points in the "grand
>         totals"
>         summary as the error summary because each range has a TO and FROM
>         address.  There are very few addresses with no digits (a few each in
>         MI, WI and PR).  We can ignore them completely without much loss of
>         generality.  But there are enough with both digits and non-digits
>         that we had best do something sensible.  So I wrote some scripts to
>         "extend" the basic record, attempting to add a FROM_num and TO_num
>         field that is always digits only.  For All Digits addresses, these
>         are the same as the FROMHN and TOHN fields.  If FROMHN and TOHN
>         agree on everything but a single all-digit subsequence, that
>         subsequence is a logical choice for the FROM_num and TO_num fields.
>         When FROMHN and TOHN differ elsewhere, the errors and warnings
>         start popping up.  The difference between an error and a warning
>         is not crystal clear; if I can see a simple way to adjust the
>         range to "make sense", I do so, and it's a warning.  If the range
>         is pretty hopeless, it's an error.  Here's an example of a warning
>         (and the last time I'll include the entire "extended" record,
>         where any field name including lower case letter is added by me,
>         except for "_deleted", which is there in the original record).
> 
>         County 01005 (Barbour, AL), record 5468
>         $record = {
>                    'SIDE' => 'L',
>                    'TO_num' => '6',
>                    '_deleted' => '',
>                    'TLID' => '69077027',
>                    'TOTYP' => '',
>                    'FROMTYP' => 'I',
>                    'PLUS4' => '',
>                    'ARID' => '400541338686',
>                    'warnings' => [
>                                    'trimmed 1 off 1F2'
>                                  ],
>                    'TOHN' => 'F6',
>                    'FROM_parts' => [
>                                      'F',
>                                      '2'
>                                    ],
>                    'FROMHN' => '1F2',
>                    'FROM_num' => 2,
>                    'MTFCC' => 'D1000',
>                    'addresses' => 3,
>                    'ZIP' => '36027',
>                    'TO_parts' => [
>                                    'F',
>                                    '6'
>                                  ],
>                    'parity' => 'E'
>                  };
> 
>         The original range was 1F2 => F6, a pattern (extraneous digits
>         at the
>         front of one address endpoint) that happens often enough (about 650
>         times in the entire distribution) that it might (or might not)
>         be worth
>         correcting.  I simply drop the extraneous digits, with a warning,
>         yielding range F2 => F6, 3 addresses with Even parity.
> 
>         Another, less common, pattern is an extraneous - at the start of
>         one address endpoint, 92 occurrences in the distribution.  For
>         example,
> 
>         County 10001 (Kent, DE), record 97
>                    'TLID' => '68092276',
>                    'ARID' => '400404723907',
>                    'TOHN' => 'B9',
>                    'FROMHN' => '-B1',
> 
>         Here the original range, -B1 => B9, gets converted to the reasonably
>         obvious B1 => B9.  After this correction, in all but about 50 cases,
>         mixed from/to addresses agree on all the non-numeric components.
>         One of the exceptions is
> 
>         County 72021 (Bayamon, PR), record 3144
>                    'TLID' => '206027274',
>                    'ARID' => '400583928652',
>                    'TOHN' => 'OO-227',
>                    'FROMHN' => 'O3',
> 
>         This range is so far off the wall that I can't think of any way
>         to adjust it that isn't an outright guess.  But losing 50 address
>         ranges is certainly tolerable.  By far the largest class of what
>         I categorized as errors is mixed addresses differing on two or
>         more numerical components, which occurred about 13200 times.
>         All but 2 of these differed at the first and second numerical
>         component.  A typical instance is
> 
>         County 06037 (Los_Angeles, CA), record 34696
>                    'TLID' => '141604200',
>                    'ARID' => '4001117732741',
>                    'TOHN' => '1318-9',
>                    'FROMHN' => '1316-5',
> 
>         When the second components are the same length, as I believe is
>         usually the case (but I'll have to check), it's not unreasonable
>         to simply drop whatever separates the components, which would
>         yield FROM_num => 13165 and TO_num => 13189, an Odd range
>         having 13 addresses.  Given any odd number in that range,
>         we could reconstruct the "real" address by re-inserting the
>         non-digit components, for example, 13171 => 1317-1.
>         I'll probably do that, and turn the errors into warnings,
>         but it's almost certainly going to mask some real errors, like
> 
>         County 06035 (Lassen, CA), record 3888
>                    'TLID' => '126954239',
>                    'ARID' => '400360492549',
>                    'TOHN' => '708-402',
>                    'FROMHN' => '463-500',
> 
>         It is improbable that there are really 122452 even addresses
>         along the street.  But we've seen preposterously large
>         all-numeric ranges before in this thread, so maybe that
>         should just be a warning of its own.
> 
>         I figured that there wouldn't be any parity errors, since
>         that's so easy to check for, but there were nearly 1700
>         in the distribution.  For example
> 
>         County 55035 (Eau_Claire, WI), record 4214
>                    'TLID' => '600641201',
>                    'ARID' => '400696181628',
>                    'TOHN' => '4253',
>                    'FROMHN' => '4232',
> 
>         The FROMHN is even, the TOHN is odd.  I don't believe this
>         is supposed to happen, but I happen to like the ability to
>         express the concept that all the numbers from 4232 through
>         4253 can appear.  The US postal service data include a
>         parity character, E, O or B, for Even, Odd or Both,
>         in their address ranges, a scheme I prefer.  However,
>         assuming the endpoints really were intended to have the
>         same parity, perhaps we can use the opposite side,
>         or adjacent edges, to resolve the ambiguity, much as I
>         hope to do for ambiguities about increasing/decreasing
>         ranges.
> 
>         Next step: start linking adjacent edges.  -- jpl
> 
> 
>         _______________________________________________
>         Geodata mailing list
>         Geodata at lists.osgeo.org <mailto:Geodata at lists.osgeo.org>
>         http://lists.osgeo.org/mailman/listinfo/geodata
> 
> 
>     _______________________________________________
>     Geodata mailing list
>     Geodata at lists.osgeo.org <mailto:Geodata at lists.osgeo.org>
>     http://lists.osgeo.org/mailman/listinfo/geodata
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Geodata mailing list
> Geodata at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/geodata



More information about the Geodata mailing list