[postgis-users] TIP: Loading TIGER data with Primary Keys

Stephen Woodbridge woodbri at swoodbridge.com
Tue Dec 20 21:36:52 PST 2011


Hi Steve,

This is a nice tip but it does have a problem. In the tiger data "County 
Line Rd" will only be populated on the right OR left side that is in the 
county dataset it belongs to. So if you load CountyA first then you will 
get one side of the data right or left. And when you get the primary key 
failure on the next county that has that TLID, it will get skipped and 
you will loss the other side attributes. :(

You might be able to write a plpgsql trigger that can catch the primary 
key violation and update the existing record, but I haven't tried to do 
that.

-Steve W

On 12/20/2011 10:33 PM, Steve Walker wrote:
> Hi,
>
> I want to offer the following tip on how to properly build and load a
> nation-wide (or subset thereof) US Census Bureau TIGER database into
> PostGIS using ogr2ogr, in such a manner as to maintain the primary key
> integrity inherent in the native TIGER database.
>
> For the purpose of this example, I'll identify two adjacent counties
> within the same state:  CountyA and CountyB, and we will work only with
> the edges theme.
>
> We can imagine that a road named "County Line Rd" forms part of the
> boundary between these two counties.
>
> In the TIGER database, the road will represented by one or more EDGEs.
> We'll simplify this to assume that there is exactly one edge feature
> representing "County Line Rd"  and that the edge feature has a unique
> Tiger Line Identifier (TLID) of 12345687890  which is unique throughout
> the entire TIGER database, nationwide.   (We'll also simplify this to
> ignore N-to-N relations between edges, roads, and names etc.  They're
> important, but irrelevant to this discussion.)
>
> So far so good.   Unfortunately however, the countyA_edges.shp and
> countyB_edges.shp each maintain a representation of this feature.  TLID
> 1234567890, although a unique primary key, is duplicated between two
> shapefiles.
>
> Normally, ogr2ogr would neither notice nor would it care.  This is
> because ogr2ogr will by default add and populate an arbitrary primary
> key attribute 'ogc_fid' when loading data.
>
> Thus, the plain vanilla ogr2ogr load of TIGER will not fail due to any
> duplicate primary key violations, but at the expense of losing the
> normalized integrity of the database.  We will end up with County Line
> Rd and TLID 1234567890 being duplicated in the database, and we won't be
> able to leverage the native TLID primary key in all our subsequent
> operations.
>
> Here is how to fix that:
>
>
> So the first thing we would reasonably do would be something like this:
>
> ALTER TABLE edges DROP COLUMN ogc_fid;
>
> This would dump the arbitrary primary key set up by ogr2ogr
>
> And then we would attempt:
>
> ALTER TABLE edges ADD PRIMARY KEY (tlid);
>
> Which would fail once it hit the second record with a TLID of 1234567890
> (ie because it duplicates the primary key constraint.)
>
> So, this cannot be easily done after the fact:  IE when we have a whole
> state or the whole nation already loaded into the database.
> -------
>
>
> OK, so here is the TIP   (pseudo-code):
>
> Starting with an totally empty database:
>
>
> shell:
> #ogr2ogr countyA_edges.shp EDGES    #### load the first shapefile.
>
> psql:
> psql: ALTER TABLE edges DROP column ogc_fid     ### drop the arbitrary
> primary key
>
> psql:  ALTER TABLE edges ADD PRIMARY KEY (tlid);   ### add the real
> primary key.
>
> (back to the shell:)
>
> #ogr2ogr -skipfailures -gt 1 countyB_edges.shp EDGES
> #ogr2ogr -skipfailures -gt 1 countyC_edges.shp EDGES
> #ogr2ogr -skipfailures -gt 1 countyD_edges.shp EDGES
> #ogr2ogr -skipfailures -gt 1 countyE_edges.shp EDGES
> #ogr2ogr -skipfailures -gt 1 countyF_edges.shp EDGES
> #ogr2ogr -skipfailures -gt 1 countyG_edges.shp EDGES
> #ogr2ogr -skipfailures -gt 1 countyH_edges.shp EDGES
> #ogr2ogr -skipfailures -gt 1 countyI_edges.shp EDGES
> #etc..... through all 3000+ counties
>
> The two flags 'skipfailures' and 'gt 1' make this possible.
>
> -gt 1 isolates each and every feature in the shapefile into its own
> transition block, (BEGIN  ... INSERT .... COMMIT|ROLLBACK)
>
> thus, the second attempt to INSERT tlid 1234567890 will fail with a
> primary key constraint violation.    But when it fails, it will only
> ROLLBACK this single individual INSERT.  the previous record and the
> following record won't be affected (without gt -1 it would ROLLBACK
> everything)
>
> -skipfailures flag simply allows the bulk load procedure to continue
> when the individual INSERTS fail.
>
>
> ====
>
> That's really what it takes to build primary key integrity into your
> TIGER database from the outset.
>
> -S
>
>
>
>
>
>
>




More information about the postgis-users mailing list