[gdal-dev] Optimizing access to shapefiles

Martin Dobias wonder.sk at gmail.com
Mon Jul 19 10:34:40 EDT 2010


Hi Frank

On Mon, Jul 19, 2010 at 3:46 PM, Frank Warmerdam <warmerdam at pobox.com> wrote:
>> 1. allow users of OGR library set which fields they really need. Most
>> of time is wasted by fetching all the attributes, but typically none
>> or just one attribute is necessary when rendering. For that, I've
>> added the following call:
>> OGRLayer::SetDesiredFields(int numFields, int* fields);
>> The user passes an array of ints, each item tells whether the field
>> should be fetched (1) or not (0). The numFields tells the size of the
>> array. If numFields < 0 then the layer will return all fields (default
>> behavior). The driver implementation then just before fetching a field
>> checks whether to fetch the field or not. This optimization could be
>> easily used in any driver, I've implemented it only for shapefiles.
>> The speedup will vary depending on the size of the attribute table and
>> number of desired fields. On my test shapefile containing 16 fields,
>> the data has been fetched up to 3x faster when no fields were set as
>> desired.
>
> Martin,
>
> Would GetFeature() still return a feature with a full vector of
> fields, but those not desired just being left in the null state?

Yes, that's what the patch does - it only omits fetching the value of
some fields.

> If so, I think such an approach would be reasonable.  However, it will
> require an RFC process to update the core OGR API.  Are you willing
> to prepare such an RFC?

Will do.


>> 2. reuse allocated memory. When a new shape is going to be read within
>> shapelib, new OGRShape object and its coordinate arrays are allocated.
>> By reusing one such temporary OGRShape object within a layer together
>> with the coordinate arrays (only allowing them to grow - to
>> accommodate larger shapes), I have obtained further speedup of about
>> 30%.
>
> As GetFeature() returns a feature instance that becomes owned by the
> caller I do not see how this could be made to function without a
> fundamental change in the OGR API.  Perhaps you can explain?

One note to avoid confusion: the suggestion I've made above relates
only to shapefile driver in OGR and doesn't impose any changes to the
API. The suggested patch reuses OGRShape instances which are passed
between OGR shapefile driver and shapelib. These OGRShape instances
never get to the user, so it's just a matter of internal working of
the shapefile driver. Please take a look at the patch if still
unclear.

Below I explain the further idea which I haven't implemented yet,
which should save allocations/deallocations of OGRFeature instances
and which could boost the speed of retrieval of data from any OGR
driver:

GetFeature() returns a new instance and DestroyFeature() deletes that
instance. My idea is that DestroyFeature() call would save the
instance in a pool (list) of "returned" feature instances. These
returned features could be reused by the GetFeature() - it will take
one from the list instead of creating a new instance. I think this
doesn't make any influence on the public OGR API, because the
semantics will be the same. Only the OGR internals will be modified so
that it will not destroy OGRFeature instance immediately, because it
will assume that more GetFeature() calls will be issued.

If the pool would be specific for each OGRLayer, many
allocations/deallocations of OGRFeature and OGRField instances could
be saved, because the features contain the same fields, they would
only have to be cleaned (but the array would stay as-is). A layer has
usually the same type of geometry for all features, so even geometries
could be kept and only the size of the coordinate array would be
altered between the calls.

Hopefully now it makes more sense.

Regards
Martin


More information about the gdal-dev mailing list