Parallelizing calls to msDrawLayer()

Sat Oct 13 16:41:13 EDT 2007

Thomas,

    Thanks for the thoughts & profile.  Agreed, these "prepare" and 
"merge" steps would be the major costs, which can't be easily parallelized.

    I wrote a small program to test the time to msImageCreateGD(), then 
msFreeImage(), n RGBA layers (see below).  Allocating RGBA canvases for 
seven 800x600 layers takes < 30ms.  I guess for most people this would 
be reasonable.

    Agreed, the merge/compositing step will be the major cost.  Don't 
you think a 99% transparency profile is a little unfair though, if GD 
(as you suggest) fast-paths the cases where the source image's pixel is 
0% or 100% opaque?  Even with antialiasing-happy AGG, I think the 0% or 
100% opacity case will be overwhelmingly common for all but line layers. 
  I suggest then, that compositing will not be as costly as it was in 
your 99%-opacity profile.  To test the typical likelihood & effect of 
"free" pixel compositing, we'd need to hack msDrawMap(), unless you can 
think of an easier way?

    Given that compositing will still be expensive though, here's a 
(rather wild) idea.  If the machine has multiple cores, then we build a 
composite tree, with leaf nodes corresponding to the layers' imageObjs:

                  *
                 /  \
               /      \
             /          \
           /              \
         /                 *
       /                  /  \
      *                  /    \
     /  \               /      \
    /    \             /        \
   /      *           *          *
  /     /   \       /   \       /  \
1     2     3     4     5     6    7

Step 1 (in parallel):
Core 1 composes layers 2 & 3
Core 2 composes layers 4 & 5
Core 3 composes layers 6 & 7

Step 2 (in parallel)
Core 1 composes layers 1 & {2 & 3}
Core 2 composes layers {4 & 5} & {6 & 7}

Step 3:
Core 1 composes layers {1 & {2 & 3}} & {{4 & 5} & {6 & 7}}

Obviously we leave the exact mapping of composition processes ==> cores 
to the OS scheduler.

The tree's edges could be adaptive, so that if layer 1 takes a long time 
to render, layers 2 - 7 could be composed in the meantime.  In fact, any 
contiguous group of rendered layers could be composed while waiting on 
the others.

The "render tree" idea starts to sound a little less like the product of 
a warped mind in light of the recent and ongoing multi-core explosion, 
courtesy of the good folks at AMD & Intel. :)

Thoughts?

Thanks,

Dave Fuhry

Athlon 64 3200+, 800 x 600:
$ for i in 1 10 20 30 40 50 60 70 80 90 100; do echo -n "$i layers " && 
  /usr/bin/time -f "%e sec" ./alloc_img_simult 800 600 $i; done
1 layers 0.00 sec
10 layers 0.03 sec
20 layers 0.06 sec
30 layers 0.08 sec
40 layers 0.11 sec
50 layers 0.13 sec
60 layers 0.16 sec
70 layers 0.18 sec
80 layers 0.21 sec
90 layers 0.24 sec
100 layers 0.27 sec

Athlon 64 3200+, 1000 x 1000:
$ for i in 1 10 20 30 40 50 60 70 80 90 100; do echo -n "$i layers " && 
  /usr/bin/time -f "%e sec" ./alloc_img_simult 1000 1000 $i; done
1 layers 0.00 sec
10 layers 0.06 sec
20 layers 0.11 sec
30 layers 0.16 sec
40 layers 0.22 sec
50 layers 0.27 sec
60 layers 0.33 sec
70 layers 0.37 sec
80 layers 0.44 sec
90 layers 0.48 sec
100 layers 0.54 sec

P4 3.2GHz, 800x600:
$ for i in 1 10 20 30 40 50 60 70 80 90 100; do echo -n "$i layers " && 
  /usr/bin/time -f "%e sec" ./alloc_img_simult 800 600 $i; done
1 layers 0.00 sec
10 layers 0.02 sec
20 layers 0.04 sec
30 layers 0.06 sec
40 layers 0.07 sec
50 layers 0.09 sec
60 layers 0.11 sec
70 layers 0.13 sec
80 layers 0.15 sec
90 layers 0.17 sec
100 layers 0.19 sec

P4 3.2GHz, 1000x1000:
$ for i in 1 10 20 30 40 50 60 70 80 90 100; do echo -n "$i layers " && 
  /usr/bin/time -f "%e sec" ./alloc_img_simult 1000 1000 $i; done
1 layers 0.00 sec
10 layers 0.04 sec
20 layers 0.08 sec
30 layers 0.12 sec
40 layers 0.16 sec
50 layers 0.19 sec
60 layers 0.23 sec
70 layers 0.27 sec
80 layers 0.31 sec
90 layers 0.35 sec
100 layers 0.39 sec

thomas bonfort wrote:
> interresting idea...
> the problem with this approach is that parralelizing the rendering is 
> far from being
> a "free" operation:
> - memory-wise, you have to allocate n-layers images to render each of 
> them seperately
> - computationnaly wise, you have n-layers full image blending 
> operations, which is far
> from being a light operation.
> 
> I've profiled the rendering of a 7 layer map setting each layer to 99% 
> opacity, which currently
> means that the layer is rendered on a temp image and blended onto the 
> map image for each
> layer ( this roughly simulates the operations that would happen if using 
> your approach,
> except that we don't get some of the "free" blended pixels, i.e. pixels 
> whose opacity is 100%
> and therefore are just replaced in the final image instead of composited 
> ). the result is that
> 50% of computing time is spent on the blending operations.
> 
> seeing this, we'd have to make sure that this non negligeable processing 
> supplement is
> compensated by the overhead 'won' when doing the i/o for each layer.
> 
> tb
> 
> On 10/12/07, *David Fuhry* < dfuhry at cs.kent.edu 
> <mailto:dfuhry at cs.kent.edu>> wrote:
> 
>     Has anyone looked into parallelizing the calls to msDraw[Query]Layer()
>     in msDrawMap()?
> 
>     Although I'm new to the codebase, it seems that near the top of
>     msDrawMap(), we could launch a thread for each (non-WMS/WFS) layer,
>     rendering the layer's output onto its own imageObj.  Then where we now
>     call msDraw[Query]Layer, wait for thread i to complete, and compose that
>     layer's imageObj onto the map's imageObj.
> 
>     In msDraw[Query]Layer(), critical sections of the mapObj (adding labels
>     to the label cache, for instance) would need to be protected by a
>     mutex.
> 
>     A threaded approach would let some layers get drawn while others are
>     waiting on I/O or for query results, instead of the current serial
>     approach where each layer is drawn in turn.  Multiprocessor machines
>     could schedule the threads across all of their cores for simultaneous
>     layer rendering.
> 
>     It seems this could significantly speed up common-case rendering,
>     especially on big machines, for very little overhead.  Has there been
>     previous work in this area, or are any major drawbacks evident?
> 
>     Thanks,
> 
>     Dave Fuhry
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: alloc_img_simult.c
Type: text/x-csrc
Size: 661 bytes
Desc: not available
Url : http://lists.osgeo.org/pipermail/mapserver-dev/attachments/20071013/ac935c2c/alloc_img_simult.bin