[GRASS-dev] Python loop and SQLite performance issue?

Tue Mar 3 01:13:48 PST 2015

Hi,

I am struggling with the performance of SQLite (I think), esp. when I use it in a python loop executed in parallel processes (using xargs) .

I am analyzing characteristics of a relatively large number (270k) of overlapping lake catchments which were generated in GRASS and now are stored in a PostGIS DB. I split the data in (10) chunks and analyse each chunk in it`s own GRASS 70 mapset (with SQLite backend) where I am looping over the catchments one by one (in python).

At first I tried to import the individual catchments using v.in.ogr. But v.in.ogr was slowing down the process significantly. It took 11 seconds to import a single, not very complex polygon (which is probably related to: https://trac.osgeo.org/grass/ticket/2185 ?; btw. my GRASSDB is not on NFS). So I switched to using gdal_rasterize and linked the resulting raster to GRASS (r.external) (as I am planning to work with rasters ater anyway). Rasterization and import takes all together less than a second. It made no difference for the speed of v.in.ogr if I imported the attribute table or not. However, converting the raster to vector in GRASS takes less than a second, so the topology creation does not seem to be the issue and also an attribute table is created...

Then I add a relatively large number of columns (400 or something) using v.db.addcolumn. That again takes 19 seconds for my single test process. If I run all 10 chunks in parallel (I have 24 cores and lots of memory available), adding the columns takes 30 seconds for each catchment, almost twice as much). During the loop the time spend on adding the columns continues increasing up to almost 30 min (at that point I canceled the process)... There is obviously something not working as it should be...

Analysing (r.univar) ca. 40 raster maps takes for the smaller catchments all together less than 5 seconds.

After that I removed all SQLite related code from my script and loaded results directly back to PostgreSQL/PostGIS. Then the smaller catchments are done in less than 5 seconds...

Does anyone have an idea what cause this performance loss is due to? Is it no good practice to call a python script (v.db.addcolumn) in a python loop, or is this related to SQLite journal files or ...
I am grateful for any hints!

Cheers
Stefan

P.S.: I can share the script if that helps identifying the issue...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/grass-dev/attachments/20150303/4cd0af7c/attachment.html>