[gdal-dev] gdal_polygonize.py TIF to JSON performance

Even Rouault even.rouault at spatialys.com
Mon Jan 12 03:09:59 PST 2015


Chris,

As underlined by David, the time spent in raster I/O is presumably neglectable
and not the issue here.
How many polygons were generated in this execution ?
A good way of identifying a bottleneck is to run the process under gdb and
regularly break with Ctrl+C and display the backtrace, and that a few times.

Threading the algorithm is indeed a potential way of speeding things, but
reconciling the various outputs isn't necessarily trivial. And currently the
algorithm works in a streaming mode regarding the output, which is great to be
able to output to GeoJSON which only supports streamed writes. The multithreaded
version would presumably needs a temporary file to handle intermediate results.

Even

> Hi David,
>
> Thanks for your response.  I have a little more information since
> feeding your response to the project team:
>
> "The tif file is around 1.4GB as you noted and the data is similar to
> that of the result of an image classification where each pixel value
> is in a range between (say) 1-5. After a classification this image is
> usually exported as a vector file (EVF of Shapefile) but in this case
> we want to use geojson. This has taken both Mark and myself weeks to
> complete with gdal_polygonize as you noted.
>
> I think an obvious way to speed this up would be threading by breaking
> the tiff file in tiles (say 1024x1024) and spreading these over the
> available cores, then there would need to be a way to dissolve the
> tile boundaries to complete the polygons as we would not want obvious
> tile lines."
>
> Does this help?
>
> Many thanks,
>
> Chris
>
> On 11 January 2015 at 18:31, David Strip <gdal at stripfamily.net> wrote:
> > I'm surprised at your colleague's experience. We've run some polygonize on
> > large images and have never had this problem. The g2.2xlarge instance is
> > overkill in the sense that the code is not multi-threaded, so the extra
> CPUs
> > don't help. Also, as you have already determined, the image is read in
> small
> > chunks, so you don't need large buffers for the image. But two weeks make
> no
> > sense. In fact, your run shows that the job reaches 5% completion in a
> > couple of hours.
> >
> > The reason for so many reads (though 2.3 seconds out of "a few hours" is
> > negligible overhead) is that the algorithm operates on a pair of adjacent
> > raster lines at a time. This allows processing of extremely large images
> > with very modest memory requirements. It's been a while since I've looked
> at
> > the code, but from my recollection, the algorithm should scale
> approximately
> > linearly in the number of pixels and polygons in the image. Far more
> > important to the run-time is the nature of the image itself. If the input
> is
> > something like a satellite photo, your output can be orders of magnitude
> > larger than the input image, as you can get a polygon for nearly every
> > pixel. If the output format is a verbose format like KML or JSON, the
> number
> > of bytes to describe each pixel is large. How big was the output in your
> > colleague's run?
> >
> > The algorithm runs in two passes. If I'm reading the code right, the
> > progress indicator is designed to show 10% at the end of the first pass.
> You
> > will have a better estimate of the run-time on your VM by noting the
> elapsed
> > time to 10%, then the elapsed time from 10% to 20%.
> >
> > Also, tell us more about the image. Is it a continuous scale raster - eg, a
> > photo? One way to significantly reduce the output size (and hence runtime),
> > as well as to get a more meaningful output in most cases, is to posterize
> > the image into a small number of colors/tones. Then run a filter to remove
> > isolated pixels or small groups of pixels. Polygonize run on this
> > pre-processed image should perform better.
> >
> > Bear in mind that the algorithm is such that the first pass will be very
> > similar in run-time for the unprocessed and pre-processed image. However,
> > the second pass is more sensitive to the number of polygons and should
> > improve for the posterized image.
> >
> > Hopefully Frank will weigh in where I've gotten it wrong or missed
> > something.
> >
> >
> > On 1/11/2015 10:11 AM, chris snow wrote:
> >
> > I have been informed by a colleague attempting to convert a 1.4GB TIF file
> > using gdal_polygonize.py on a g2.2xlarge Amazon instance (8 vCPU, 15gb RAM)
> > that the processing took over 2 weeks running constantly.   I have also
> been
> > told that the same conversion using commercial tooling was completed in a
> > few minutes.
> >
> > As a result, I'm currently investigating to see if there is an opportunity
> > for improving the performance of the gdal_polygonize.py TIF to JSON
> > conversion.  I have run a strace while attempting the same conversion, but
> > stopped after a few hours (the gdal_polygonize.py status indicator was
> > showing between 5% and 7.5% complete).  The strace results are:
> >
> >
> > % time     seconds  usecs/call     calls    errors syscall
> > ------ ----------- ----------- --------- --------- ----------------
> >  99.40    2.348443           9    252474           read
> >  ...
> >   0.00    0.000000           0         1           set_robust_list
> > ------ ----------- ----------- --------- --------- ----------------
> > 100.00    2.362624                256268       459 total
> >
> >
> > FYI - I performed my test inside a vagrant virtualbox guest with 30GB
> memory
> > and 8 CPUS assigned to the guest.
> >
> > It appears that the input TIF file is read in small pieces at a time.
> >
> > I have shared the results here in case any one else is looking at
> optimising
> > the performance of the conversion or already has ideas where the code can
> be
> > optimised.
> >
> > Best regards,
> >
> > Chris
> >
> >
> > _______________________________________________
> > gdal-dev mailing list
> > gdal-dev at lists.osgeo.org
> > http://lists.osgeo.org/mailman/listinfo/gdal-dev
> >
> >
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>


-- 
Spatialys - Geospatial professional services
http://www.spatialys.com


More information about the gdal-dev mailing list