<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    I'm surprised at your colleague's experience. We've run some

    polygonize on large images and have never had this problem. The

    g2.2xlarge instance is overkill in the sense that the code is not

    multi-threaded, so the extra CPUs don't help. Also, as you have

    already determined, the image is read in small chunks, so you don't

    need large buffers for the image. But two weeks make no sense. In

    fact, your run shows that the job reaches 5% completion in a couple

    of hours. <br>

    <br>

    The reason for so many reads (though 2.3 seconds out of "a few

    hours" is negligible overhead) is that the algorithm operates on a

    pair of adjacent raster lines at a time. This allows processing of

    extremely large images with very modest memory requirements. It's

    been a while since I've looked at the code, but from my

    recollection, the algorithm should scale approximately linearly in

    the number of pixels and polygons in the image. Far more important

    to the run-time is the nature of the image itself. If the input is

    something like a satellite photo, your output can be orders of

    magnitude larger than the input image, as you can get a polygon for

    nearly every pixel. If the output format is a verbose format like

    KML or JSON, the number of bytes to describe each pixel is large.

    How big was the output in your colleague's run?<br>

    <br>

    The algorithm runs in two passes. If I'm reading the code right, the

    progress indicator is designed to show 10% at the end of the first

    pass. You will have a better estimate of the run-time on your VM by

    noting the elapsed time to 10%, then the elapsed time from 10% to

    20%.<br>

    <br>

    Also, tell us more about the image. Is it a continuous scale raster

    - eg, a photo? One way to significantly reduce the output size (and

    hence runtime), as well as to get a more meaningful output in most

    cases, is to posterize the image into a small number of

    colors/tones. Then run a filter to remove isolated pixels or small

    groups of pixels. Polygonize run on this pre-processed image should

    perform better. <br>

    <br>

    Bear in mind that the algorithm is such that the first pass will be

    very similar in run-time for the unprocessed and pre-processed

    image. However, the second pass is more sensitive to the number of

    polygons and should improve for the posterized image.<br>

    <br>

    Hopefully Frank will weigh in where I've gotten it wrong or missed

    something. <br>

    <br>

    <br>

    On 1/11/2015 10:11 AM, chris snow wrote:

    <blockquote type="cite">

      <div dir="ltr">

        <div><span style="font-family:monospace,monospace">I have been

            informed by a colleague attempting to convert a 1.4GB TIF

            file using gdal_polygonize.py on a g2.2xlarge Amazon

            instance (8 vCPU, 15gb RAM) that the processing took over 2

            weeks running constantly.   I have also been told that the

            same conversion using commercial tooling was completed in a

            few minutes.<br>

            <br>

            As a result, I'm currently investigating to see if there is

            an opportunity for improving the performance of the

            gdal_polygonize.py TIF to JSON conversion.  I have run a

            strace while attempting the same conversion, but stopped

            after a few hours (the gdal_polygonize.py status indicator

            was showing between 5% and 7.5% complete).  The strace

            results are:<br>

            <br>

            <br>

            % time     seconds  usecs/call     calls    errors syscall<br>

            ------ ----------- ----------- --------- ---------

            ----------------<br>

             99.40    2.348443           9    252474           read<br>

             ...<br>

              0.00    0.000000           0         1          

            set_robust_list<br>

            ------ ----------- ----------- --------- ---------

            ----------------<br>

            100.00    2.362624                256268       459 total<br>

            <br>

            <br>

          </span></div>

        <span style="font-family:monospace,monospace">FYI - I performed

          my test inside a vagrant virtualbox guest with 30GB memory and

          8 CPUS assigned to the guest.<br>

        </span>

        <div><span style="font-family:monospace,monospace"><br>

            It appears that the input TIF file is read in small pieces

            at a time.<br>

          </span>

          <div><span style="font-family:monospace,monospace"><br>

              I have shared the results here in case any one else is

              looking at optimising the performance of the conversion or

              already has ideas where the code can be optimised.<br>

              <br>

            </span></div>

          <div><span style="font-family:monospace,monospace">Best

              regards,<br>

              <br>

            </span></div>

          <div><span style="font-family:monospace,monospace">Chris<br>

            </span></div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

gdal-dev mailing list

<a class="moz-txt-link-abbreviated" href="mailto:gdal-dev@lists.osgeo.org">gdal-dev@lists.osgeo.org</a>

<a class="moz-txt-link-freetext" href="http://lists.osgeo.org/mailman/listinfo/gdal-dev">http://lists.osgeo.org/mailman/listinfo/gdal-dev</a></pre>

    </blockquote>

    <br>

  </body>

</html>