<div dir="ltr">Have they added z-ordering to the Tiger data? I am still hoping to get a routing application based on Tiger data, which requires that I can disambiguate the overpasses from the roads they "touch"<br>
<br>Cordially,<br><br>Joe Bussell<br><br><br><div class="gmail_quote">On Wed, Jul 23, 2008 at 8:13 AM, Stephen Woodbridge <<a href="mailto:woodbri@swoodbridge.com">woodbri@swoodbridge.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi John,<br>
<br>
These reports and analysis are really great! Sorry, that I have not been able to contribute more, but I have been distracted by a client on another project. I hope to get back to this and add my two cents into it also.<br>
<br>
I have also run into the address number range issues that you mentioned and I like you idea of dropping the extraneous characters, I was just throwing an error in my code.<br>
<br>
Can you talk a little bit about your development environment. It looks like you are using Perl. Are you using database as a backing store to help with the processing? MySQL, postgresql, sqlite, other?<br>
<br>
Thanks,<div class="Ih2E3d"><br>
-Steve<br>
<br>
John P. Linderman wrote:<br>
</div><div><div></div><div class="Wj3C7c"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
As I mentioned earlier in this thread, one aid to disambiguating<br>
street segments where address ranges are increasing on one side,<br>
but decreasing on the other, is to link segments for the same<br>
street together, and see how adjacent ranges line up. But<br>
sometimes a street enters an intersection from more than two<br>
directions. See, for example, 40.670528,-74.457660 in google<br>
maps, where Chaucer Dr forms a topological "lollipop", with houses<br>
on both sides of the loop and along the stem. If you come to the<br>
loop from the stem, and want to choose which edge to take out,<br>
the address ranges may be helpful in making the choice. For<br>
example, if the houses along the outside of the loop have odd<br>
addresses, and those on the inside, even addresses, then the<br>
"obvious" choice would be edge preserving the parity of the stem.<br>
That is, if the stem has odd edges on the right approaching the loop,<br>
then one would want to turn right, keeping the odd edges on the right<br>
(outside), and vice versa. Address numbers might also be a guide,<br>
trying to minimize the gap as one moves from edge to edge. But to<br>
measure gaps, or establish parity, there must be a "number",<br>
which is obvious when address ranges are purely numeric,<br>
but less obvious when there are non-digits involved.<br>
So, before worrying about linking edges together, I wanted to<br>
get a handle on the nature of individual ranges from the<br>
Address Ranges Relationship File.<br>
<br>
Here are some summary data for the entire Tiger2007 distribution.<br>
<br>
Address range grand totals<br>
All Digits: 70505410<br>
Mixed: 1608442<br>
No Digits: 36<br>
Total: 72113888<br>
<br>
Errors in all address ranges<br>
Errors: 14937<br>
Warnings: 742<br>
Clean: 36041265<br>
Total: 36056944<br>
<br>
There are (exactly) twice as many sample points in the "grand totals"<br>
summary as the error summary because each range has a TO and FROM<br>
address. There are very few addresses with no digits (a few each in<br>
MI, WI and PR). We can ignore them completely without much loss of<br>
generality. But there are enough with both digits and non-digits<br>
that we had best do something sensible. So I wrote some scripts to<br>
"extend" the basic record, attempting to add a FROM_num and TO_num<br>
field that is always digits only. For All Digits addresses, these<br>
are the same as the FROMHN and TOHN fields. If FROMHN and TOHN<br>
agree on everything but a single all-digit subsequence, that<br>
subsequence is a logical choice for the FROM_num and TO_num fields.<br>
When FROMHN and TOHN differ elsewhere, the errors and warnings<br>
start popping up. The difference between an error and a warning<br>
is not crystal clear; if I can see a simple way to adjust the<br>
range to "make sense", I do so, and it's a warning. If the range<br>
is pretty hopeless, it's an error. Here's an example of a warning<br>
(and the last time I'll include the entire "extended" record,<br>
where any field name including lower case letter is added by me,<br>
except for "_deleted", which is there in the original record).<br>
<br>
County 01005 (Barbour, AL), record 5468<br>
$record = {<br>
'SIDE' => 'L',<br>
'TO_num' => '6',<br>
'_deleted' => '',<br>
'TLID' => '69077027',<br>
'TOTYP' => '',<br>
'FROMTYP' => 'I',<br>
'PLUS4' => '',<br>
'ARID' => '400541338686',<br>
'warnings' => [<br>
'trimmed 1 off 1F2'<br>
],<br>
'TOHN' => 'F6',<br>
'FROM_parts' => [<br>
'F',<br>
'2'<br>
],<br>
'FROMHN' => '1F2',<br>
'FROM_num' => 2,<br>
'MTFCC' => 'D1000',<br>
'addresses' => 3,<br>
'ZIP' => '36027',<br>
'TO_parts' => [<br>
'F',<br>
'6'<br>
],<br>
'parity' => 'E'<br>
};<br>
<br>
The original range was 1F2 => F6, a pattern (extraneous digits at the<br>
front of one address endpoint) that happens often enough (about 650<br>
times in the entire distribution) that it might (or might not) be worth<br>
correcting. I simply drop the extraneous digits, with a warning,<br>
yielding range F2 => F6, 3 addresses with Even parity.<br>
<br>
Another, less common, pattern is an extraneous - at the start of<br>
one address endpoint, 92 occurrences in the distribution. For example,<br>
<br>
County 10001 (Kent, DE), record 97<br>
'TLID' => '68092276',<br>
'ARID' => '400404723907',<br>
'TOHN' => 'B9',<br>
'FROMHN' => '-B1',<br>
<br>
Here the original range, -B1 => B9, gets converted to the reasonably<br>
obvious B1 => B9. After this correction, in all but about 50 cases,<br>
mixed from/to addresses agree on all the non-numeric components.<br>
One of the exceptions is<br>
<br>
County 72021 (Bayamon, PR), record 3144<br>
'TLID' => '206027274',<br>
'ARID' => '400583928652',<br>
'TOHN' => 'OO-227',<br>
'FROMHN' => 'O3',<br>
<br>
This range is so far off the wall that I can't think of any way<br>
to adjust it that isn't an outright guess. But losing 50 address<br>
ranges is certainly tolerable. By far the largest class of what<br>
I categorized as errors is mixed addresses differing on two or<br>
more numerical components, which occurred about 13200 times.<br>
All but 2 of these differed at the first and second numerical<br>
component. A typical instance is<br>
<br>
County 06037 (Los_Angeles, CA), record 34696<br>
'TLID' => '141604200',<br>
'ARID' => '4001117732741',<br>
'TOHN' => '1318-9',<br>
'FROMHN' => '1316-5',<br>
<br>
When the second components are the same length, as I believe is<br>
usually the case (but I'll have to check), it's not unreasonable<br>
to simply drop whatever separates the components, which would<br>
yield FROM_num => 13165 and TO_num => 13189, an Odd range<br>
having 13 addresses. Given any odd number in that range,<br>
we could reconstruct the "real" address by re-inserting the<br>
non-digit components, for example, 13171 => 1317-1.<br>
I'll probably do that, and turn the errors into warnings,<br>
but it's almost certainly going to mask some real errors, like<br>
<br>
County 06035 (Lassen, CA), record 3888<br>
'TLID' => '126954239',<br>
'ARID' => '400360492549',<br>
'TOHN' => '708-402',<br>
'FROMHN' => '463-500',<br>
<br>
It is improbable that there are really 122452 even addresses<br>
along the street. But we've seen preposterously large<br>
all-numeric ranges before in this thread, so maybe that<br>
should just be a warning of its own.<br>
<br>
I figured that there wouldn't be any parity errors, since<br>
that's so easy to check for, but there were nearly 1700<br>
in the distribution. For example<br>
<br>
County 55035 (Eau_Claire, WI), record 4214<br>
'TLID' => '600641201',<br>
'ARID' => '400696181628',<br>
'TOHN' => '4253',<br>
'FROMHN' => '4232',<br>
<br>
The FROMHN is even, the TOHN is odd. I don't believe this<br>
is supposed to happen, but I happen to like the ability to<br>
express the concept that all the numbers from 4232 through<br>
4253 can appear. The US postal service data include a<br>
parity character, E, O or B, for Even, Odd or Both,<br>
in their address ranges, a scheme I prefer. However,<br>
assuming the endpoints really were intended to have the<br>
same parity, perhaps we can use the opposite side,<br>
or adjacent edges, to resolve the ambiguity, much as I<br>
hope to do for ambiguities about increasing/decreasing<br>
ranges.<br>
<br>
Next step: start linking adjacent edges. -- jpl<br>
<br>
<br>
_______________________________________________<br>
Geodata mailing list<br>
<a href="mailto:Geodata@lists.osgeo.org" target="_blank">Geodata@lists.osgeo.org</a><br>
<a href="http://lists.osgeo.org/mailman/listinfo/geodata" target="_blank">http://lists.osgeo.org/mailman/listinfo/geodata</a><br>
</blockquote>
<br>
_______________________________________________<br>
Geodata mailing list<br>
<a href="mailto:Geodata@lists.osgeo.org" target="_blank">Geodata@lists.osgeo.org</a><br>
<a href="http://lists.osgeo.org/mailman/listinfo/geodata" target="_blank">http://lists.osgeo.org/mailman/listinfo/geodata</a><br>
</div></div></blockquote></div><br></div>