[Gdal-dev] new RFC 13: Improved Feature Insertion/Update/Delete Performance in Batch Mode

Fri May 18 07:22:12 EDT 2007

> -----Original Message-----
> From: Frank Warmerdam [mailto:warmerdam at pobox.com]
> Sent: Wednesday, May 16, 2007 4:06 PM
> To: Baumann, Konstantin
> Cc: gdal-dev at lists.maptools.org
> Subject: Re: [Gdal-dev] new RFC 13: Improved Feature
> Insertion/Update/Delete Performance in Batch Mode
> 
> Baumann, Konstantin wrote:
> > Hi!
> >
> > I would like to announce the new RFC 13:
> > <http://trac.osgeo.org/gdal/wiki/rfc13_createfeatures>
> >
> > I would like to add some new methods to OGRLayer for "batched"
> > inserting/updating/deleting features. These virtual methods allow
OGR
> > drivers to implement optimized handling of these batched operations
> > (e.g. resulting in a performance boost).
> >
> > For example, I was able to improve the "batched insertion" scenario
for
> > the MySQL driver from ~40 features per second to something about
> > 800-1.500 features per second; for layers with >=50.000 features
(with
> > and wihtout geometry)...
> >
> > Any further comments and suggestions are wellcome.
> 
> Kosta,
> 
> I presume the optimization in the mysql driver was accomplished by
doing
> all the features in a single CreateFeatures() call as a single
> transaction.
> Is that right?

Yes.

> There are other mechanisms to accomplish this - based on the OGR
> transaction
> semantics.  For instance, I think the postgres and oracle drivers
> implement
> ogr transaction semantics and can do a bunch of CreateFeature() calls
in
> a single transaction.  The ogr2ogr program already defaults to doing
> feature
> creation in groups of 200 by default, with transaction semantics
around
> this.

The startTransaction()/commitTransaction() methods are only implemented
by the SQLite and PostGis drivers...

> On the other hand, the transaction semantics are quite complicated to
> implement properly in drivers, which is why I suspect they haven't
been
> done in the MySQL driver.  So if we were to proceed with this RFC I
think
> it would be because it provides a simplier approach to grouping create
> features than the transaction semantics.

Yes, for example: if you do an OGRLayer::CreateFeature() call inside of
an active transaction, and you rollback that transaction, the FID has
still be changed for the given feature given. This may or may not be a
problem (due to the also vague FID semantics in OGR)...

And the implementation complexity is not to neglect, too.

> But because this introduces an extra interface and the related
complexity
> I'm hesitant to support it when there is another way to accomplish the
> same thing.

I've done some performance testings, I would like to share. The used
dataset was a 25.000 element polygon shapefile, which should be imported
into a MySQL database.

1. Test:
current OGR-MySQL-Impl: the generated SQL-commands are like this:

INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_1_1>,
<value_1_2>, ...);
INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_2_1>,
<value_2_2>, ...);
INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_3_1>,
<value_3_2>, ...);
...
=> ~40 features per second

2. Test:
Wrapped current OGR-MySQL-Impl with "START TRANSACTION" and "COMMIT":

START TRANSACTION;
INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_1_1>,
<value_1_2>, ...);
INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_2_1>,
<value_2_2>, ...);
INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_3_1>,
<value_3_2>, ...);
...
COMMIT;
=> ~750 features per second

3. Test:
Optimized CreateFeatures() impl., by using multiple (250) values
definitions inside a single INSERT statement:

INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_1_1>,
<value_1_2>, ...), (<value_2_1>, <value_2_2>, ...), (...);
INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_251_1>,
<value_251_2>, ...), (<value_252_1>, <value_252_2>, ...), (...);
...
=> ~850 features per second

4. Test:
Optimized CreateFeatures() impl. (like test #3) with transaction
wrapper:

START TRANSACTION;
INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_1_1>,
<value_1_2>, ...), (<value_2_1>, <value_2_2>, ...), (...);
INSERT INTO table (<field_1>, <field_2>, ...) VALUES (<value_251_1>,
<value_251_2>, ...), (<value_252_1>, <value_252_2>, ...), (...);
...
COMMIT;
=> ~850 features per second (more or less as test #3)

BTW, the timings are more or less indepenent of runing the MySQL server
on localhost or on a remote machine connected via 1 GBit LAN...

===> Draw your own conclusions, here... :-) <===

> One other small concern.  I think it can be inconvenient to implement
> swig wrappers for interfaces taking arrays of objects.  Can any of the
> swig folks comment on this?

I am not familiar with SWIG, but could the following approach by
providing an "extra array object" easier to implement:

struct OGRFeatureArray {
    OGRFeature** papoFeatures;
    int iFeatureCount;
};

OGRErr OGR_L_CreateFeatures( OGRFeatureArray featureArray );

Kind regards,
    Kosta