<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
I'm surprised at your colleague's experience. We've run some
polygonize on large images and have never had this problem. The
g2.2xlarge instance is overkill in the sense that the code is not
multi-threaded, so the extra CPUs don't help. Also, as you have
already determined, the image is read in small chunks, so you don't
need large buffers for the image. But two weeks make no sense. In
fact, your run shows that the job reaches 5% completion in a couple
of hours. <br>
<br>
The reason for so many reads (though 2.3 seconds out of "a few
hours" is negligible overhead) is that the algorithm operates on a
pair of adjacent raster lines at a time. This allows processing of
extremely large images with very modest memory requirements. It's
been a while since I've looked at the code, but from my
recollection, the algorithm should scale approximately linearly in
the number of pixels and polygons in the image. Far more important
to the run-time is the nature of the image itself. If the input is
something like a satellite photo, your output can be orders of
magnitude larger than the input image, as you can get a polygon for
nearly every pixel. If the output format is a verbose format like
KML or JSON, the number of bytes to describe each pixel is large.
How big was the output in your colleague's run?<br>
<br>
The algorithm runs in two passes. If I'm reading the code right, the
progress indicator is designed to show 10% at the end of the first
pass. You will have a better estimate of the run-time on your VM by
noting the elapsed time to 10%, then the elapsed time from 10% to
20%.<br>
<br>
Also, tell us more about the image. Is it a continuous scale raster
- eg, a photo? One way to significantly reduce the output size (and
hence runtime), as well as to get a more meaningful output in most
cases, is to posterize the image into a small number of
colors/tones. Then run a filter to remove isolated pixels or small
groups of pixels. Polygonize run on this pre-processed image should
perform better. <br>
<br>
Bear in mind that the algorithm is such that the first pass will be
very similar in run-time for the unprocessed and pre-processed
image. However, the second pass is more sensitive to the number of
polygons and should improve for the posterized image.<br>
<br>
Hopefully Frank will weigh in where I've gotten it wrong or missed
something. <br>
<br>
<br>
On 1/11/2015 10:11 AM, chris snow wrote:
<blockquote type="cite">
<div dir="ltr">
<div><span style="font-family:monospace,monospace">I have been
informed by a colleague attempting to convert a 1.4GB TIF
file using gdal_polygonize.py on a g2.2xlarge Amazon
instance (8 vCPU, 15gb RAM) that the processing took over 2
weeks running constantly. I have also been told that the
same conversion using commercial tooling was completed in a
few minutes.<br>
<br>
As a result, I'm currently investigating to see if there is
an opportunity for improving the performance of the
gdal_polygonize.py TIF to JSON conversion. I have run a
strace while attempting the same conversion, but stopped
after a few hours (the gdal_polygonize.py status indicator
was showing between 5% and 7.5% complete). The strace
results are:<br>
<br>
<br>
% time seconds usecs/call calls errors syscall<br>
------ ----------- ----------- --------- ---------
----------------<br>
99.40 2.348443 9 252474 read<br>
...<br>
0.00 0.000000 0 1
set_robust_list<br>
------ ----------- ----------- --------- ---------
----------------<br>
100.00 2.362624 256268 459 total<br>
<br>
<br>
</span></div>
<span style="font-family:monospace,monospace">FYI - I performed
my test inside a vagrant virtualbox guest with 30GB memory and
8 CPUS assigned to the guest.<br>
</span>
<div><span style="font-family:monospace,monospace"><br>
It appears that the input TIF file is read in small pieces
at a time.<br>
</span>
<div><span style="font-family:monospace,monospace"><br>
I have shared the results here in case any one else is
looking at optimising the performance of the conversion or
already has ideas where the code can be optimised.<br>
<br>
</span></div>
<div><span style="font-family:monospace,monospace">Best
regards,<br>
<br>
</span></div>
<div><span style="font-family:monospace,monospace">Chris<br>
</span></div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
gdal-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:gdal-dev@lists.osgeo.org">gdal-dev@lists.osgeo.org</a>
<a class="moz-txt-link-freetext" href="http://lists.osgeo.org/mailman/listinfo/gdal-dev">http://lists.osgeo.org/mailman/listinfo/gdal-dev</a></pre>
</blockquote>
<br>
</body>
</html>