[Geodata] [Tiger] A few interesting observations on the Tiger2007fedata

John P. Linderman jpl at research.att.com
Tue Jul 1 11:21:47 EDT 2008


Stephen Woodbridge <woodbri at swoodbridge.com> said (in part):
> It would be great if someone from the US Census would monitor
> this list.  I'll have to see if I can find anyone that might be
> interested. It would also be neat to setup some kind a database like:
> 
> user|date|tlid|ss|ccc|file|action|fieldname|oldval|newval
> 
> This would allow us to create a database of corrections, errors, etc 
> that could be automatically applied to the data when processing it and 
> could be given to the Census if they are interested?
> 
> Any thoughts on this, on setting something like this up? Maybe it is not 
> worth the effort.
> 
> -Steve

I think that's a fabulous idea.  It sounds, from mail I got back
from TigerLine, that they don't expect to do a real cleanup
until after the 2010 census...

> John,
> 
> The Census Bureau geography staff ran an address edit program to fix
> address range inconsistencies,  like the one that you mention below,
> however some of the address ranges could have been missed.  It is also
> possible that the 'mixed order range' is due to the introduction of new
> address ranges to the database.  Depending on the source file that was used
> to update the database, the data could have been entered in according to
> the source file or it could have been entered in reverse order.  In either
> case, no more address range edits will be run on the data until sometime
> after the 2010 Census therefore these inconsistencies will continue to
> appear in the data.

So we either live with dirty data for two more years, correct our own
copies, or make the corrections irresistible to the TigerLine people.
Which indicates to me that we invite them to make suggestions about the
format they might find most useful.  I'll copy the generic tiger email
contact, and recommend they might want to elect someone to sign up at

http://lists.osgeo.org/mailman/listinfo/geodata

if they are interested in what we are trying to do (it's low volume).
But let's not cc them, unsolicited, on too much stuff, lest we wear
out our welcome.

As for your specific suggestion,

> user|date|tlid|ss|ccc|file|action|fieldname|oldval|newval

since not every file has tlid as a key, I think we might want
something closer to

ss|ccc|file|key|fieldname|date|action|oldval|newval|user|comments

If sorted, this brings all the action for a given file, record,
and field together, which would be pretty handy for seeing if
a proposed change is already present.  (I tend to think in terms
of sorted flat files, since they are highly portable, but this
doesn't exclude a database version with indexes on all the
important fields.)  The comments (which might be broken down
into further fields, if they are structured), would make the
changes easier to understand, like

reduced address number 277 to 27 because range 29-99 appears in tlid 60569908

And, given that we probably haven't thought of all the fields,
and that comments might get pretty long, we might want key/value
pairs on multiple lines, separated by empty lines, like

ss	34
ccc	039
file	edges
...
#	adjacent edge 60569908 begins at address 29
#	adjacent edge 60569922 ends at address 21

In any event, the ability to share corrections in a structured
way seems most worthwhile.

PS: I'll comment on other stuff from Stephen in separate mail,
since this seems like an important thread of its own. -- jpl



More information about the Geodata mailing list