[gdal-dev] Optimizing access to shapefiles

Mon Jul 19 13:54:20 EDT 2010

Ragi Burhum wrote:
> Would it make sense instead of implementing a SetDesiredFields(..) to 
> implement a SetSubFields(string fieldnames) where the function
> takes a comma delimited list of subfields and then those are parsed by 
> the shapefile driver to find out which field values to fetch? That way, 
> for other drivers that have a SQL based underlying datastore, the way 
> they would implement that fetching behavior would be by putting that 
> content between the SELECT and the FROM portion.

Ragi,

I don't get the distinction here.  Why can't the RDBMS based providers
just construct their SELECT clause based on the names of the fields
selected with SetDesiredFields()?  Are you seeking a chance for the
app to insert arbitrary field operations?  If so, ExecuteSQL() is the
right avenue for that (IMHO).

Martin Dobias wrote:
 > One note to avoid confusion: the suggestion I've made above relates
 > only to shapefile driver in OGR and doesn't impose any changes to the
 > API. The suggested patch reuses OGRShape instances which are passed
 > between OGR shapefile driver and shapelib. These OGRShape instances
 > never get to the user, so it's just a matter of internal working of
 > the shapefile driver. Please take a look at the patch if still
 > unclear.

I'm not sure what an OGRShape is.  Perhaps you are referring to
OGRFeature?  Or SHPObject?    If the optimization is to reuse
a SHPObject in repeated calls to Shapelib then this is indeed
something that could be pursued without impact on the broader
OGR API though I'd be amazed to find it makes a really big
difference.

 > GetFeature() returns a new instance and DestroyFeature() deletes that
 > instance. My idea is that DestroyFeature() call would save the
 > instance in a pool (list) of "returned" feature instances. These
 > returned features could be reused by the GetFeature() - it will take
 > one from the list instead of creating a new instance. I think this
 > doesn't make any influence on the public OGR API, because the
 > semantics will be the same. Only the OGR internals will be modified so
 > that it will not destroy OGRFeature instance immediately, because it
 > will assume that more GetFeature() calls will be issued.
 >
 > If the pool would be specific for each OGRLayer, many
 > allocations/deallocations of OGRFeature and OGRField instances could
 > be saved, because the features contain the same fields, they would
 > only have to be cleaned (but the array would stay as-is). A layer has
 > usually the same type of geometry for all features, so even geometries
 > could be kept and only the size of the coordinate array would be
 > altered between the calls.

This seems *possible* but pretty complicated and if not done very
carefully could introduce additional problems.  I can't help but
wonder if you aren't just using a poor heap implementation which
is making allocations and deallocations unnecessarily expensive.
Reworking huge amounts of code around the assumption that
new/delete are terribly expensive does not seem entirely prudent.

Best regards,
-- 
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | Geospatial Programmer for Rent