[gdal-dev] gdal_polygonize.py TIF to JSON performance

chris snow chsnow123 at gmail.com
Mon Jan 12 02:07:56 PST 2015


Hi David,

Thanks for your response.  I have a little more information since
feeding your response to the project team:

"The tif file is around 1.4GB as you noted and the data is similar to
that of the result of an image classification where each pixel value
is in a range between (say) 1-5. After a classification this image is
usually exported as a vector file (EVF of Shapefile) but in this case
we want to use geojson. This has taken both Mark and myself weeks to
complete with gdal_polygonize as you noted.

I think an obvious way to speed this up would be threading by breaking
the tiff file in tiles (say 1024x1024) and spreading these over the
available cores, then there would need to be a way to dissolve the
tile boundaries to complete the polygons as we would not want obvious
tile lines."

Does this help?

Many thanks,

Chris

On 11 January 2015 at 18:31, David Strip <gdal at stripfamily.net> wrote:
> I'm surprised at your colleague's experience. We've run some polygonize on
> large images and have never had this problem. The g2.2xlarge instance is
> overkill in the sense that the code is not multi-threaded, so the extra CPUs
> don't help. Also, as you have already determined, the image is read in small
> chunks, so you don't need large buffers for the image. But two weeks make no
> sense. In fact, your run shows that the job reaches 5% completion in a
> couple of hours.
>
> The reason for so many reads (though 2.3 seconds out of "a few hours" is
> negligible overhead) is that the algorithm operates on a pair of adjacent
> raster lines at a time. This allows processing of extremely large images
> with very modest memory requirements. It's been a while since I've looked at
> the code, but from my recollection, the algorithm should scale approximately
> linearly in the number of pixels and polygons in the image. Far more
> important to the run-time is the nature of the image itself. If the input is
> something like a satellite photo, your output can be orders of magnitude
> larger than the input image, as you can get a polygon for nearly every
> pixel. If the output format is a verbose format like KML or JSON, the number
> of bytes to describe each pixel is large. How big was the output in your
> colleague's run?
>
> The algorithm runs in two passes. If I'm reading the code right, the
> progress indicator is designed to show 10% at the end of the first pass. You
> will have a better estimate of the run-time on your VM by noting the elapsed
> time to 10%, then the elapsed time from 10% to 20%.
>
> Also, tell us more about the image. Is it a continuous scale raster - eg, a
> photo? One way to significantly reduce the output size (and hence runtime),
> as well as to get a more meaningful output in most cases, is to posterize
> the image into a small number of colors/tones. Then run a filter to remove
> isolated pixels or small groups of pixels. Polygonize run on this
> pre-processed image should perform better.
>
> Bear in mind that the algorithm is such that the first pass will be very
> similar in run-time for the unprocessed and pre-processed image. However,
> the second pass is more sensitive to the number of polygons and should
> improve for the posterized image.
>
> Hopefully Frank will weigh in where I've gotten it wrong or missed
> something.
>
>
> On 1/11/2015 10:11 AM, chris snow wrote:
>
> I have been informed by a colleague attempting to convert a 1.4GB TIF file
> using gdal_polygonize.py on a g2.2xlarge Amazon instance (8 vCPU, 15gb RAM)
> that the processing took over 2 weeks running constantly.   I have also been
> told that the same conversion using commercial tooling was completed in a
> few minutes.
>
> As a result, I'm currently investigating to see if there is an opportunity
> for improving the performance of the gdal_polygonize.py TIF to JSON
> conversion.  I have run a strace while attempting the same conversion, but
> stopped after a few hours (the gdal_polygonize.py status indicator was
> showing between 5% and 7.5% complete).  The strace results are:
>
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  99.40    2.348443           9    252474           read
>  ...
>   0.00    0.000000           0         1           set_robust_list
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    2.362624                256268       459 total
>
>
> FYI - I performed my test inside a vagrant virtualbox guest with 30GB memory
> and 8 CPUS assigned to the guest.
>
> It appears that the input TIF file is read in small pieces at a time.
>
> I have shared the results here in case any one else is looking at optimising
> the performance of the conversion or already has ideas where the code can be
> optimised.
>
> Best regards,
>
> Chris
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>


More information about the gdal-dev mailing list