[Gdal-dev] Slow JPEG2000 Reading with JP2KAK driver

Fri Jun 30 00:53:27 EDT 2006

Folks,

A client recently complained of poor performance reading with the
JP2KAK jpeg2000 driver.  I thought the points raised in my response
might be of somewhat general interest, as some apply to a variety of
drivers and situations, so I am forwarding a slightly edited copy of the
message here.

=== response:

There are a few issues with the current GDAL JP2KAK (Kakadu) driver.

The first is that it takes a tiled approach to reading the jpeg2000 files.
The code you have defaults to a 512x128 blocking size in the GDAL cache.
So if you want to read the whole file, it will do this as a quite a large
number of 512x128 reads.  Each read involves "setting a window" on the
jpeg2000 through the Kakadu library.  This process seems to be fairly
expensive.  A preferrable approach is to set a single full file size window
and then read through that one line at a time.

However, because GDAL does not know the access pattern of the calling
application in advance, it can be very hard to decide on a window strategy.
If the application ends up requesting lots of little bits of imagery
then I would either read through a lot of unused imagery from a big window,
or have to reset the window for each request which can be terrible if
each request is just one scanline for instance.

So I take this blocked approach that is rather suboptimal in the best
case, but at least not too bad in the worst case.

The second factor is that, as you supposed, multi-band files are read
one band at a time.  I'm not sure this is really a problem for some
datasets which are internally essentially band oriented anyways.  But
for some it might mean some extra processing or at least file io is
being done making distinct passes for red, green and blue.  For RGB
images, I actually just ask the Kakadu library for one band at a time,
so at least it is optimizing the access somewhat.  But for YCbCr images
I need to read all the bands just to compute any of red, green or blue.

So I have done some restructuring internally in the YCbCr case so that
once a chunk is read, the red, green and blue components are all
pushed into the block cache.  For my test YCbCr file (a common arrangement
for color jpeg2000 files) this sped things up by a factor of about 2.5.
So a substantial improvement, but not an order of magnitude.

I also did some tuning on the default block size, increasing it to
2048x128.  This sped processing up by about 25%.   Overall a translation
of a 3200 x 2600 file went from about 12 seconds to about 3.5 seconds
with these two code changes.   The old speed was about 2MB/second read
speed, while the new speed is about 7MB/second.

I'm not sure why your collegue was seeing processing speeds of about
0.3MB/second.  That does seem unreasonable slow.

But the *third* thing to keep in mind is that tile-blocked images
such as jpeg2000 are quite sensitive to "cache thrashing".  That is,
if the application asks for a single scanline of a very wide image,
it may be that the cache cannot hold a whole "row" of tiles in the
cache at one time.  If not, the caching system will start throwing
away the early tiles and the next scanline request will result in
all the tiles being re-read.  So it is very important from a
performance point of view to ensure that the GDAL caching limit is
at least large enough to hold a whole row of blocks.  For a 100000
pixel wide RGB image, you would need roughly 384MB of cache space
to hold a complete row of blocks (100000 x 128).  This cost of a
row is why I use a very asymetric size for the blocks - to reduce
the likelyhood of cache thrashing.

I am wondering if this might have been happening to your co-worker.

There are a few other things that I can potentially do to improve
the efficiency of the driver.

1) Provider overridden IRasterIO() methods that process large
    windowed requests as a single "set window and read" operation
    on the jpeg2000 files instead of using the blocking approach.
    This would help for applications that ask for data in large
    chunks.  I already do this in the MrSID and ECW SDK based
    jpeg2000 readers, though deciding when to use the special
    case can be problematic.

2) Modify the block based RasterIO() implementation to copy all
    data out of an acquired block into the working buffer before
    moving on to the next block.  Currently the logic attempts to
    satisfy all pixels in out destination buffer scanline before
    starting the next, which case result in cache thrashing even
    for large request windows.  This problem has been reported in
    the past by another client and I do intend to address it at
    some point.

    This would help a variety of block oriented drivers that can
    suffer cache thrashing on large datasets.

3) I could temporarily force the GDAL cache max to be large enough
    to hold one whole row of blocks while large files are open.
    This would avoid "cache thrashing" within GDAL, but might easily
    result in overall virtual memory thrashing.

Best regards,
-- 
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | President OSGF, http://osgeo.org