[gdal-dev] SoC Report: GDAL TMS driver
keo at keo.cz
Sat Aug 2 12:47:04 EDT 2008
Report for 14.7. -- 2.8. 2008.
Warning: lengthy and somewhat depressing account of my doings follows.
First two weeks I spent in a scout camp. Kids were great, no work on GDAL
Then I returned back to the driver. I started experimenting with the
simplified cache as mentioned in the previous report, hoping that I would
discover the data mangling bug in the process. Which I did. It was due to an
incorrect calculation of the gap on the top of tiles which caused the whole
image to be shifted one block up. Fixed that. Reading now works without a
hitch for me.
After that I spent some time on proper input validation and overall
hardening. So far so good.
But for the whole time I have been pondering about the IO performance and
how to do it best. The problem is this.
GDAL uniformly touches raster data in bandwise, top-down, left-right order.
This is the exact opposite of what TMS driver needs. As I have written in
the last post, TMS tiles doesn't map 1:1 to GDAL blocks. Two tiles, one
above the other, contain data for one block. To be efficient, the driver
must therefore cache the tiles. But in the top-down left-right order this
means that two whole lines of tiles must be in cache at all times. For
example, sixteen megabytes of memory is enough for sixty four png tiles
256x256 pixels. So the driver will efficiently operate only on requests that
read at most 32x256 = 8192 pixels wide raster.
If the driver could work in the left-right top-down order, however, only the
two tiles composing a single block would have to be cached.
Same goes for bands. The default implementation of Dataset::IRasterIO just
calls RasterBand::IRasterIO on each band in turn. And while one line of
tiles might usualy fit into cache, the whole raster most definitely
wouldn't. This means that for png tiles, each file would be touched four
Therefore the optimal order is left-right, top-down, bandwise. This way,
each tile file is read/written only once. How could this be achieved? The
only place where all necessary information is present is the
Dataset::IRasterIO method, because only it can know which bands were
requested. So, my idea was to write Dataset::IRasterIO to use some other
generic method DoIO based on the default IRasterIO but with the optimal
order of access. RasterBand::IRasterIO would call DoIO on its dataset with
appropriate arguments. DoIO would have to find the best overview when
reading and write into all overviews of course.
So I started to dig into that. I finished the basic structure and decided to
test it to see if all methods were called correctly.
They weren't of course. Mainly because gdal_translate (which I have been
using for testing) doesn't even call Dataset::RasterIO. It loops through the
raster bands and adds them one after each other to the output dataset.
Exactly what it must not do in order to achieve optimality. And this is not
an exception. The notion of raster band seems to be pilar, and the various
parts of GDAL code use them frequently.
So, what do I do?
One theoreticaly possible way to optimise writing is to just store IO
requests in some data structure and only in the FlushCache method optimise
and actualy do them. But nothing like this is possible for reading since the
caller expects the data to be transferred at the end of the call. I don't
see a way out, the TMS driver will be slow and will thrash the hard drive.
To sum up what I have in my hands right now:
Reading TMS datasets works. The infrastructure for writing blocks is mostly
in place (ten or so lines missing) but I don't have the code that creates
new datasets in the filesystem yet. The cache works but should be improved a
little. All this could be finished in a day or two. After that comes the
rest of GDALDataset boilerplate: transformations, GCPs etc. These I haven't
studied much yet.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the gdal-dev