[postgis-users] TIP: Loading TIGER data with Primary Keys
Steve Walker
walker at mfgis.com
Wed Dec 21 08:31:31 PST 2011
Steve,
I think, actually, perhaps, no.
The EDGES feature and its attributes should be the same in both county
shapefiles.
The ADDR record is where the feature's address attributes reside,
including the 'side' attribute.
tlid | numeric(10,0) |
fromhn | character(12) |
tohn | character(12) |
side | character(1) |
zip | character(5) |
plus4 | character(4) |
fromtyp | character(1) |
totyp | character(1) |
arid | character(22) | not null
mtfcc | character(5) |
A single line segment, with a unique TLID, may be represented in
multiple linear features, represented by LINEARID, which may then be
linked to mutiple address records, via the ARID primary key of the ADDR
table.
So, CountyA and CountyB will share the same EDGE, but their ADDR
records will differ as you noted. But since ADDR has its own primary
key in ARID, the failure of the line feature in the CountyB_edges
shapefile will not affect loading of the ADDR record, since it will have
a differing ARID primary key value than the first.
-
On Wed, 2011-12-21 at 00:36 -0500, Stephen Woodbridge wrote:
> Hi Steve,
>
> This is a nice tip but it does have a problem. In the tiger data "County
> Line Rd" will only be populated on the right OR left side that is in the
> county dataset it belongs to. So if you load CountyA first then you will
> get one side of the data right or left. And when you get the primary key
> failure on the next county that has that TLID, it will get skipped and
> you will loss the other side attributes. :(
>
> You might be able to write a plpgsql trigger that can catch the primary
> key violation and update the existing record, but I haven't tried to do
> that.
>
> -Steve W
>
> On 12/20/2011 10:33 PM, Steve Walker wrote:
> > Hi,
> >
> > I want to offer the following tip on how to properly build and load a
> > nation-wide (or subset thereof) US Census Bureau TIGER database into
> > PostGIS using ogr2ogr, in such a manner as to maintain the primary key
> > integrity inherent in the native TIGER database.
> >
> > For the purpose of this example, I'll identify two adjacent counties
> > within the same state: CountyA and CountyB, and we will work only with
> > the edges theme.
> >
> > We can imagine that a road named "County Line Rd" forms part of the
> > boundary between these two counties.
> >
> > In the TIGER database, the road will represented by one or more EDGEs.
> > We'll simplify this to assume that there is exactly one edge feature
> > representing "County Line Rd" and that the edge feature has a unique
> > Tiger Line Identifier (TLID) of 12345687890 which is unique throughout
> > the entire TIGER database, nationwide. (We'll also simplify this to
> > ignore N-to-N relations between edges, roads, and names etc. They're
> > important, but irrelevant to this discussion.)
> >
> > So far so good. Unfortunately however, the countyA_edges.shp and
> > countyB_edges.shp each maintain a representation of this feature. TLID
> > 1234567890, although a unique primary key, is duplicated between two
> > shapefiles.
> >
> > Normally, ogr2ogr would neither notice nor would it care. This is
> > because ogr2ogr will by default add and populate an arbitrary primary
> > key attribute 'ogc_fid' when loading data.
> >
> > Thus, the plain vanilla ogr2ogr load of TIGER will not fail due to any
> > duplicate primary key violations, but at the expense of losing the
> > normalized integrity of the database. We will end up with County Line
> > Rd and TLID 1234567890 being duplicated in the database, and we won't be
> > able to leverage the native TLID primary key in all our subsequent
> > operations.
> >
> > Here is how to fix that:
> >
> >
> > So the first thing we would reasonably do would be something like this:
> >
> > ALTER TABLE edges DROP COLUMN ogc_fid;
> >
> > This would dump the arbitrary primary key set up by ogr2ogr
> >
> > And then we would attempt:
> >
> > ALTER TABLE edges ADD PRIMARY KEY (tlid);
> >
> > Which would fail once it hit the second record with a TLID of 1234567890
> > (ie because it duplicates the primary key constraint.)
> >
> > So, this cannot be easily done after the fact: IE when we have a whole
> > state or the whole nation already loaded into the database.
> > -------
> >
> >
> > OK, so here is the TIP (pseudo-code):
> >
> > Starting with an totally empty database:
> >
> >
> > shell:
> > #ogr2ogr countyA_edges.shp EDGES #### load the first shapefile.
> >
> > psql:
> > psql: ALTER TABLE edges DROP column ogc_fid ### drop the arbitrary
> > primary key
> >
> > psql: ALTER TABLE edges ADD PRIMARY KEY (tlid); ### add the real
> > primary key.
> >
> > (back to the shell:)
> >
> > #ogr2ogr -skipfailures -gt 1 countyB_edges.shp EDGES
> > #ogr2ogr -skipfailures -gt 1 countyC_edges.shp EDGES
> > #ogr2ogr -skipfailures -gt 1 countyD_edges.shp EDGES
> > #ogr2ogr -skipfailures -gt 1 countyE_edges.shp EDGES
> > #ogr2ogr -skipfailures -gt 1 countyF_edges.shp EDGES
> > #ogr2ogr -skipfailures -gt 1 countyG_edges.shp EDGES
> > #ogr2ogr -skipfailures -gt 1 countyH_edges.shp EDGES
> > #ogr2ogr -skipfailures -gt 1 countyI_edges.shp EDGES
> > #etc..... through all 3000+ counties
> >
> > The two flags 'skipfailures' and 'gt 1' make this possible.
> >
> > -gt 1 isolates each and every feature in the shapefile into its own
> > transition block, (BEGIN ... INSERT .... COMMIT|ROLLBACK)
> >
> > thus, the second attempt to INSERT tlid 1234567890 will fail with a
> > primary key constraint violation. But when it fails, it will only
> > ROLLBACK this single individual INSERT. the previous record and the
> > following record won't be affected (without gt -1 it would ROLLBACK
> > everything)
> >
> > -skipfailures flag simply allows the bulk load procedure to continue
> > when the individual INSERTS fail.
> >
> >
> > ====
> >
> > That's really what it takes to build primary key integrity into your
> > TIGER database from the outset.
> >
> > -S
> >
> >
> >
> >
> >
> >
> >
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
--
Steve Walker
Middle Fork Geographic Information Services
(360)671-2505
More information about the postgis-users
mailing list