[postgis-users] DB design considerations : geoAttributes vs geometry

Tue Apr 12 10:55:21 PDT 2005

On Tue, 12 Apr 2005, Moisan Yves wrote:

> Hi All,
> 
> Where 
> 
> - geoInfoColumns = string {country; province; adminRegion; city ...} 
> 
> - dataColumns = the real data
> 
> The_geom and dataColumns are hard data.  The question is about optional
> geoInfoColumns that I put in square brackets.  When I upload a new
> station location, is it worth adding geoInfoColmuns *to use in filters*
> or can I rely on the_geom only ?  

Personally, I would have them send the data for several reasons.  

First, you don't HAVE to use everything they send you, but it will be
easier for them to setup acquisition and reporting procedures and training 
once, rather than having you add two attributes each month as you add 
filters.

Second, the extra data lets you use two big features of PostgreSQL that 
can significantly speed things up: expression indexes and partial indexes.  
These have made HUGE differences in non-GIS databases I've managed, and I 
wouldn't want to NOT have the data I need to use them.

>From what I've seen, most GIS apps are read-heavy with nightly (or rarer) 
updates to data.  With that, I wouldn't worry about the extra update time 
with the indexes.  If you're worrying about adding width to the rows, I'd 
just keep you extra information in a separate table (or more) and join 
when you need to.

> 
> The point is that I can make an arbitrary number of such optional
> sub-regions available that I could index on, e.g. "select * from table
> where sub-sub-sub-region = 'aRegion'", even though those optional
> columns really represent an over-determination in the sense that they
> are not needed to locate the station : they would be used to speed up
> access.
> 
> I am not asking to what level the over-determination can go.  That I can
> play with.  I am just asking if making the DB much bigger by adding
> redundant information in the form of geoInfoColumns attributes is worth
> it in terms of access time.  I would rather write utilities to translate
> queries filtered by sub-region so that they map on the_geom (e.g. have
> some table or function or stroed proc that returns a polygon when I feed
> it the region name) rather than add an arbitrary number of attributes
> (plus I would avoid being tied to geographical names that are very
> likely to change with time, e.g. a political riding or some other
> administrative region!).
> 

You will usually be using a bounding box for determining what area to look 
at (at least if you're doing online mapping).  Therefore, you'll be using 
GIST indexes and boxes from your mapping software or map generator.  
However, those indexes get far more useful when combined with other 
attributes.  

Take for example a set of parcels that have a shape, along with other 
attributes, including the most recent sale date.  If I want to show a map 
of recent sales, I can create a partial index like this:

create index on parcels using GIST (parcel_shape) where sale_year = 
'2005';

If I then include that on a mapserver map like this (adjust your SRID as 
needed):

DATA "parcels_shape from (select oid, parcel_shape from parcels where
sale_year = '2005') as foo using unique oid using SRID=26759"

It will use the relatively small partial index, rather than the big index 
of all the parcels.

Of course, in this case the sale_year isn't redundant data, it's needed.  
However, the concept is still valid.  Sometimes data that is redundant 
geographically can help you significantly tune a database. 

Hope this helps.
Bill