[Benchmarking] data block caching technique

Andrea Aime aaime at opengeo.org
Mon Sep 6 14:00:44 EDT 2010


> Hi All,
>
> Looking at MapServer and Geoserver results, it seems clear to me that these
> numbers mainly result from server-side data block caching.

Neither MapServer nor GeoServer are caching the data block themselves.
It's the operating system that does it for us, we are simply using
standard file reading
API.

Actually I tried to load in memory the spatial indexes (which is a
form of caching) and that did not improve the situation significantly
(it actually worsened the raster
results as there was less memory available for raster calculations in the
Java heap).

The reason why MapServer and GeoServer are faster is that they have
enough data locality in their reads  that the OS can actually keep the blocks
in memory.

> Many discussions have already been done on this subject and I think that
> many agree that caching data blocks and using the exact same requests in the
> 3 runs cannot produce realistic results. Indeed, in a realistic use-case,
> the exact same request with the exact same bounding box will never occur
> twice.
>
> I see that this has also been noted by Constellation-SDI during their own
> testing (http://wiki.osgeo.org/wiki/Benchmarking_2010/Constellation-SDI)
> "Note that this is merely an academic exercise since for any non-trivial
> dataset, the size of data on disk will be so much larger than the size of
> available memory that these numbers will never be achieved". Moreover,
>
> We think that this combination of unrealistic test conception and data block
> caching technique is unfair to other participants and will make their
> results looks bad, while they might perform as good or even better in a real
> world use-case.
>
> I think that every one should publish all 3 run results and guarantee that
> these have been measured just after server restarting. We would also like
> that the ones using such technique rerun their test after disabling it.

There is no technique, I think we just less data (better performing indexes,
avoid opening dbf and shp files if the index said there was nothing to
read and so on)
I actually spent almost a week working on improving that side of GeoServer
so that the OS cache could be leveraged to the best.

Just as a curiosity, have you tried making 6 runs? Does that make you break
though the disk boundness?
For example, in my benchmarking setup the shapefile 4326 and the shapefile
3857 are run in sequence, which is basically making it do an access to the
same data for 6 times in a row, but we break free of disk boundness towards
the end of the 2nd 4326 run.
But if more runs were sufficient we'd see people getting much better results
in the reprojected case, which is accessing the same data for the 4th, 5th
and 6th time. Is anyone actually seeing that?

Cheers
Andrea


More information about the Benchmarking mailing list