[Geodata] [Tiger] A few interesting observations on the
Tiger2007fedata
Stephen Woodbridge
woodbri at swoodbridge.com
Mon Jun 30 18:08:21 EDT 2008
Hi John,
Thank you for the feedback. I added the geodata list to this response so
it will get archived there.
One problem on interupting the data for geocoding will be how to decide
what the various address ranges mean. ie: so they overlap each other are
the stacked like |-- range 1 --+--- range 2 --|, or are they just random
house numbers that sit on the street given some other range along the
street, etc.
The other fear I have is that this is rev 0.0 of a new process and
format for the Census, and it is likely buggy as all get out as most rev
0.0 products are. For example we know future versions are supposed to
have the inclusive ranges on the edges shapefile.
Glad to have you on board and sharing your experiences.
It would be great if someone from the US Census would monitor this list.
I'll have to see if I can find anyone that might be interested. It would
also be neat to setup some kind a database like:
user|date|tlid|ss|ccc|file|action|fieldname|oldval|newval
This would allow us to create a database of corrections, errors, etc
that could be automatically applied to the data when processing it and
could be given to the Census if they are interested?
Any thoughts on this, on setting something like this up? Maybe it is not
worth the effort.
-Steve
John P. Linderman wrote:
> Schuyler Erle told me about this list, so blame him if I'm
> being annoying :-) I, too, have been pawing through the
> Tiger2007 distribution, and trying to make sense of it.
> I've already mentioned a number of suspicious conditions
> to Schuyler, and I hope they'll be of interest here.
>
> Top-level quotes here are from Stephen Woodbridge <woodbri at swoodbridge.com>
>> Other issues inline below ...
>>
>> Bob Basques wrote:
>>> Stephen,
>>>
>>> I ran into some of these problems with our local dataset. The
>>> multiple Zip code assignments is explainable up to 4 (or possible 6)
>>> by the segment ending up at the same spot where four different zip
>>> code boundaries come together. If it's more than four, that would be
>>> something that would be harder to explain.
>> Well it is understandable from the point of view that zipcode are NOT
>> areas but postal carrier routes and one street segment might be services
>> by multiple routes. In fact in many city streets the right and left side
>> are often serviced by separate routes. I am just surprised to see it
>> here in the census data and surprised to find the it is extremely common
>> to find one side that has multiple zipcodes!
>>
>> I found 466354 cases of this in the tiger data!
>
> It's not too surprising that there might be two different ZIP codes
> along an uninterrupted edge, but I am also seeing instances of 3
> ZIPs on a single side of a single edge, eg
> 'TLID' => '134034777',
> ZIPs 08060 08048 08036. These are all ZIPs appearing in cities
> that are close together on the map, so they don't appear to be
> fat-finger errors. But are we to believe that letter carriers
> are being sent in from both ends of the segment, and a third
> is leap-frogging past another to reach the middle? Possible,
> I suppose, that one ZIP is for a building that has a ZIP of
> its own, but this area is fairly rural, so I suspect something
> else. There were only 32 instances of 3 or more ZIPs on a
> single side in New Jersey, so the condition is probably rare
> country-wide.
>
>>> As for the multiple ranges, there are possible valid reasons for
>>> this. On is that a long segnet is broken by ranges not at an
>>> intersection but by some arbitrary line that denotes address
>>> directionals. Basiccaly, the segment could have a Zero in the middle
>>> and go up towards each end.
>> Yes, but that arbitrary line is NOT represented in Tiger or the street
>> segment would have been split at the intersection point.
>>
>> I found 232723 cases of this in the tiger data!
>>
>> More food for thought.
>>
>> -Steve
>
> I see (at least) three reasons for multiple address ranges.
> 1) Since TigerLine, unlike the US Postal service, does not
> have a field to distinguish between Even, Odd, and Both,
> a range may be added to indicate that both even and odd
> addresses occur on the same side, not terribly unusual
> on circles.
> 2) What appears to be a single range may be lacking one
> or more addresses, perhaps because a single even address
> appears on the "odd side" of the street, or because some
> addresses are simply missing.
> 3) Sometimes there are two widely separated address ranges,
> like
> 'TLID' => '202495150',
> Ranges 6973-6981 and 44124-44136
>
> I spot checked a few of these against Postal Service
> address ranges, and some agree, some don't.
>
> In the latter two cases, but certainly in the final one,
> it may be best to recognize the gap rather than geocode the
> entire range along the edge. Failure to do so for the
> final example will stack all the valid addresses at the
> extreme ends of the edge, with a huge gap in between.
> And, if an even address is on the "odd side", you want
> to code it there, or maybe miscode it on the even side,
> but not code it on both.
>
> Then there are ranges like
> 'TLID' => '134042726',
> left side range 3901-39329, right side range 3914-3946.
> This looks more like someone's finger "bounced" on some key
> once too often in setting the TO range.
>
> and the ranges already mentioned elsewhere, where ranges
> on a single side are both increasing and decreasing, or
> ranges are increasing on one side, but not the other.
> I have spot checked a number of such cases, and examining
> "adjacent edges" (those sharing an end point) with the
> same feature name often clarifies what is probably correct.
> (If adjacent edges bracket the TO and FROM addresses,
> but have opposite sense, this addresses on this edge
> are probably in the wrong order, for example, or if
> an edge adjacent to range 3901-39329 has address 3941,
> then the upper range of the huge range probably needs to
> be adjusted down to not overlap.)
>
> If TigerLine really *does* want to tolerate ranges that
> change sense in mid-edge, then they'll have to add fields
> to enforce an order on the multiple ranges, and/or add
> parity indicators, so we can distinguish ranges
> 1-99 + 2-100 as meaning 1-100 (Both parities) from
> 1-99 followed by 2-100. For now, it seems we must
> tolerate or ignore ambiguity.
>
>>> Just some food for thought.
>>>
>>> bobb
>>>
>>>
>>>
>>>>>> Stephen Woodbridge <woodbri at swoodbridge.com> 06/29/08 10:06 PM
>>> Hi all,
>>>
>>> The more I play with this data the stranger it is!
>
> I said pretty much the same thing to Schuyler -- jpl
>
More information about the Geodata
mailing list