[Gdal-dev] GDAL raster block caching issues

Frank Warmerdam fwarmerdam at gmail.com
Fri Aug 26 19:13:22 EDT 2005


On 8/26/05, Steve Soule <Steve.Soule at vexcel.com> wrote:
> Issue 1:  Global vs. dataset LRUL
>
> Currently, the LRUL is global, that is, it contains blocks from all
> open datasets.  I think it would be better if each dataset had its
> own LRUL (or possibly each raster band).  This would have the following
> advantages:
>
> 1.  Thread-safety.  A global LRUL is difficult to make thread-safe.
> This work has not yet been done in GDAL.  One thing that would be
> particularly difficult to resolve is how to handle dirty blocks, that
> is, blocks that need to be written to disk before they can be flushed.
> A dataset or raster band level LRUL would be trivially thread-safe.

Steve,

I concur that using a per-dataset LRUL would make handling the
write-cache-flushing-multi-threading problem go away completely.
A huge benefit since I have been thinking about this issue all
summer with no obvious simple solution.

> 2.  Cache size flexibility.  With a global LRUL, there's one cache
> limit (nCacheMax) for the entire process.  This is simple but
> not flexible.  When you have more than one dataset open, it may be
> that some datasets need caching more than others.  In particular,
> datasets accessed over the network are more likely to need caching
> than datasets stored on a local disk.  If each dataset had its own
> cache limit, it would be easy to tailor it to the individual dataset's
> block loading time.

Frankly, for most uses I don't see this as being much benefit.  In
my experience 98% of the time having one "cache size" knob
to turn is already too much control for people, and results in
lots of confusion.  Having per-dataset control will not be useful
except in some very carefully managed circumstances.

> 3.  Matches user behavior better.  The application where dataset LRUL
> could give a big performance improvement over global LRUL is in image
> viewing and marking.  In such an application, the user typically has
> from two to six images open at once.  They tend to scroll image one
> and mark a point, then scroll image two and mark a point, and so on
> until the point has been marked in all images, then start over with
> image one.  With global LRUL, if the scrolling is sufficient to consume
> all of the cache, then by the time the user gets back to image one,
> all of image one's raster blocks have been flushed.  So you get nearly
> 100% cache misses.  With dataset LRUL, this problem disappears.

Well, my viewer counter example would be that in a viewer people
will often open several files, keeping around old views to potentially
return to but generally only interacting with one view at a time.
To reserve large amounts of cache memory for open datasets that
haven't been used for some time seems like a very poor use of
cache memory.

Also, it seems like it will be hard to decide how to set the
per dataset cache size.  Should we divide a total cache size by the
number of datasets open at any given time?  Should it be a fixed size
per dataset?  If the latter approach is used, having several open
datasets could easily overwhelm physical RAM.

Nevertheless, I concede that your scenario also happens  and
the current behavior is annoying.

> This isn't an artificial example.  In my mind, there are two types
> of applications for GDAL:  image processing and user interface.
> Image processing applications typically process all of the blocks
> in an image in order; caching those blocks is useless.

At this point, I think I need to highly the *main* reason GDAL has
block caching.  That is because applications are welcome to request
data in any pattern they want.  Thus many processing applications
just request it one line at a time because that is convenient and
generally quite efficient.

However, imagine what would happen internally if blocks were
not cached on a tiled dataset.  Each scanline request would
result in reading one whole row of tiles just to extra the portion of
the scanline that goes through that tile, then the tile is discarded.
Without caching, scanlined oriented access to an image with
256x256 tiles would result in 256 reads of each tile.  Of course,
the tiles would presumably be cached by the operating system, but
you would still be moving around of data unnecessary data.  In
cases where the data is compressed on disk, it would be uncompressed
256 times instead of one.

So, block caching *mostly* exists so that applications can access
data on non-block boundaries without suffering too much of a
performance hit.

> 5.  Makes turning off write caching possible.  I personally would
> prefer not to have write caching in my applications.  If the cache
> size were a property of the dataset rather than being global, I could
> turn off caching for datasets in which I'm writing data.

Similarly to the above, write caching mostly exists so that applications
can write data on non-block boundaries without having to constantly
re-read, and re-write blocks to disk just to update a little bit of data.

> Issue 2:  Client-controlled block flushing
>
> Currently, if block caching is turned on in GDAL, the algorithm used
> for flushing cache blocks is LRUL.  It would be nice if the client
> could manually control loading and flushing of blocks in order to
> use a different flushing algorithm.  After some thought on how to
> do this, I realized that the capability already exists in GDAL:
> to override LRUL for a block, you lock it with GetLockedBlockRef,
> and when you want to return caching control to LRUL, you call DropLock.
>
> Though these two functions are technically part of the GDAL API since
> they're declared public in gdal_priv.h, it may be that you don't want
> these to be officially part of the API.  If you don't want these to
> be officially part of the API, perhaps you should provide an alternate
> mechanism for overriding LRUL.

These are public methods and their use is permitted, but it is my
intent that they would rarely be used by applications.  In fact, it
is my intent that nearly all raster access go through the RasterIO()
call, rather than accessing blocks directly.   Nevertheless, for an
application wishing to use the C++ block API, it is OK.

> My conclusion?  I don't really have one.  On the one hand, one could
> claim that this test shows that even in the best case, GDAL's cache
> isn't enough better than the OS cache to be worth it, and that the OS
> cache could outperform GDAL's cache in real-world examples.  If so,
> then removing GDAL's raster block cache would simplify the code and
> improve performance.  On the other hand, one could claim that this
> test shows that the OS cache performs sufficiently poorly compared to
> GDAL's cache that GDAL's cache is clearly of great benefit.  The only
> trick is setting the cache size to maximize the benefit.

Some interesting numbers, though I think they go to show that in
performance testing there are alot of factors and strategies.

Back to your overall point, I am interested in offering per-dataset
caching rather than global caching as an option (presumably
non-default).  Would you be interested in trying to implement this?

My hope is that this could be mostlydone in gdalrasterblock.cpp, likely
with a block list stored on the GDALDataset.  Actually this may be
a bit messy, since there are such things as free standing
GDALRasterBand objects not associated with any GDALDataset.  I'm
not sure how you would want to address that issue.

Ideally the policy could be set at runtime (checked via CPLGetConfigOption()).
If you are interested in doing that, then go ahead, but please let me know
when you commit the changes.

I would also ask that you avoid using too much fancy STL stuff.  I am
already getting significant negative feedback on the std::vector use
in the GDALColorTable, so I would like to avoid use of anything other
than std::string and std::vector for now.

Best regards,
--
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | Geospatial Programmer for Rent




More information about the Gdal-dev mailing list