[gdal-dev] Optimizing access to shapefiles

Mon Jul 19 12:50:20 EDT 2010

>
> Date: Mon, 19 Jul 2010 16:34:40 +0200
> From: Martin Dobias <wonder.sk at gmail.com>
> Subject: Re: [gdal-dev] Optimizing access to shapefiles
> To: Frank Warmerdam <warmerdam at pobox.com>
> Cc: gdal-dev at lists.osgeo.org
> Message-ID:
>        <AANLkTilLtLqDCDhx06Smxyhtxp7WSCTAhPGzSy3w0vP7 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi Frank
>
> On Mon, Jul 19, 2010 at 3:46 PM, Frank Warmerdam <warmerdam at pobox.com>
> wrote:
> >> 1. allow users of OGR library set which fields they really need. Most
> >> of time is wasted by fetching all the attributes, but typically none
> >> or just one attribute is necessary when rendering. For that, I've
> >> added the following call:
> >> OGRLayer::SetDesiredFields(int numFields, int* fields);
> >> The user passes an array of ints, each item tells whether the field
> >> should be fetched (1) or not (0). The numFields tells the size of the
> >> array. If numFields < 0 then the layer will return all fields (default
> >> behavior). The driver implementation then just before fetching a field
> >> checks whether to fetch the field or not. This optimization could be
> >> easily used in any driver, I've implemented it only for shapefiles.
> >> The speedup will vary depending on the size of the attribute table and
> >> number of desired fields. On my test shapefile containing 16 fields,
> >> the data has been fetched up to 3x faster when no fields were set as
> >> desired.
>

Would it make sense instead of implementing a SetDesiredFields(..) to
implement a SetSubFields(string fieldnames) where the function
takes a comma delimited list of subfields and then those are parsed by the
shapefile driver to find out which field values to fetch? That way, for
other drivers that have a SQL based underlying datastore, the way they would
implement that fetching behavior would be by putting that content between
the SELECT and the FROM portion.

> >
> > Martin,
> >
> > Would GetFeature() still return a feature with a full vector of
> > fields, but those not desired just being left in the null state?
>
> Yes, that's what the patch does - it only omits fetching the value of
> some fields.
>

Of course if this is a requirement (need to have the full vector of fields)
then there would need to be some extra work done (with the approach I
describe above) to satisfy it.

> > If so, I think such an approach would be reasonable.  However, it will
> > require an RFC process to update the core OGR API.  Are you willing
> > to prepare such an RFC?
>
> Will do.
>
>
> >> 2. reuse allocated memory. When a new shape is going to be read within
> >> shapelib, new OGRShape object and its coordinate arrays are allocated.
> >> By reusing one such temporary OGRShape object within a layer together
> >> with the coordinate arrays (only allowing them to grow - to
> >> accommodate larger shapes), I have obtained further speedup of about
> >> 30%.
> >
> > As GetFeature() returns a feature instance that becomes owned by the
> > caller I do not see how this could be made to function without a
> > fundamental change in the OGR API.  Perhaps you can explain?
>
> One note to avoid confusion: the suggestion I've made above relates
> only to shapefile driver in OGR and doesn't impose any changes to the
> API. The suggested patch reuses OGRShape instances which are passed
> between OGR shapefile driver and shapelib. These OGRShape instances
> never get to the user, so it's just a matter of internal working of
> the shapefile driver. Please take a look at the patch if still
> unclear.
>

IMHO having a way to avoid fetching data would benefit all drivers.

>
> Below I explain the further idea which I haven't implemented yet,
> which should save allocations/deallocations of OGRFeature instances
> and which could boost the speed of retrieval of data from any OGR
> driver:
>
> GetFeature() returns a new instance and DestroyFeature() deletes that
> instance. My idea is that DestroyFeature() call would save the
> instance in a pool (list) of "returned" feature instances. These
> returned features could be reused by the GetFeature() - it will take
> one from the list instead of creating a new instance. I think this
> doesn't make any influence on the public OGR API, because the
> semantics will be the same. Only the OGR internals will be modified so
> that it will not destroy OGRFeature instance immediately, because it
> will assume that more GetFeature() calls will be issued.
>
> If the pool would be specific for each OGRLayer, many
> allocations/deallocations of OGRFeature and OGRField instances could
> be saved, because the features contain the same fields, they would
> only have to be cleaned (but the array would stay as-is). A layer has
> usually the same type of geometry for all features, so even geometries
> could be kept and only the size of the coordinate array would be
> altered between the calls.
>

This is effectively what happens in ArcObjects cursors (recycling vs
non-recycling behavior). All drawing in ArcMap (except when in EditSessions)
use
recycling cursors mixed with a subfields clause since it makes drawing
*much* faster.

My two cents,

- Ragi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osgeo.org/pipermail/gdal-dev/attachments/20100719/9b63c67e/attachment.html