[mapserver-users] Ed's Rules for the Best Raster Performance
Jim Klassen
Jim.Klassen at ci.stpaul.mn.us
Tue Sep 16 13:05:39 PDT 2008
Ed,
Good points about using tiled tiffs so that mapserver doesn't have to read the whole file. I was thinking TIFFs were scanline based where you would have to do a lot of reading or a lot of seeking anyway if you wanted to pull out a subset of the file to render. JPEG in TIFF is also interesting... I have seen that before, but there were compatibility issues with it at the time (not with mapserver or gdal) so I've avoided it since. It sounds like this is at least worth some experimentation on my part.
Also, I agree this is a fairly complex problem when taking into account many simultaneous requests and that is why I asked the question.
BTW: if I were to do it again (using basically the same approach I did the first time), I'd use 1024x1024 pixel tiles instead of 1000x1000. It would make it easier to handle the factors of two in resolution and fits block boundaries better.
Jim
>>> Ed McNierney <ed at mcnierney.com> 09/15/08 9:14 PM >>>
Damn, I'm going to have to get around to unsubscribing soon so I can shut myself up!
Jim, please remember that your disk subsystem does not read only the precise amount of data you request. The most expensive step is telling the disk head to seek to a random location to start reading the data. The actual reading takes much less time in almost every case. Let's invent an example so we don't have to do too much hard research <g>.
A 7,200-RPM IDE drive has about a 9 ms average read seek time, and most are able to really transfer real data at around 60 MB/s or so (these are very rough approximations). So to read 256KB of sequential data, you spend 9 ms seeking to the right track and then 4 ms reading the data - that's 13 ms. Doubling the read size to 512KB will only take 4 ms (or 30%) longer, not 100% longer. But even that's likely to be an exaggeration, because your disk drive - knowing that seeks are expensive - will typically read a LOT of data after doing a seek. Remember that "16MB buffer" on the package? The drive will likely read far more than you need, so the "improvement" you get by cutting the amount of data read in a given seek in half is likely to be nothing at all.
There are limits, of course. The larger your data read is, the more likely it is to be split up into more than one location on disk. That would mean another seek, which would definitely hurt. But in general if you're already reading modest amounts of data in each shot, reducing the amount of data read by compression is likely to save you almost nothing in read time and cost you something in decompression time (CPUs are fast, so it might not cost much, but it will very likely require more RAM, boosting your per-request footprint, which means you're more at risk of starting to swap, etc.).
And remember that not all formats are created equal. In order to decompress ANY portion of a JPEG image, you must read the WHOLE file. If I have a 4,000x4,000 pixel 24-bit TIFF image that's 48 megabytes, and I want to read a 256x256 piece of it, I may only need to read one megabyte or less of that file. But if I convert it to a JPEG and compress it to only 10% of the TIFF's size, I'll have a 4.8 megabyte JPEG but I will need to read the whole 4.8 megabytes (and expand it into that RAM you're trying to conserve) in order to get that 256x256 piece!
Paul is right - sometimes compression is necessary when you run out of disk (but disks are pretty darn cheap - the cost per megabyte of the first hard drive I ever purchased (a Maynard Electronics 10 MB drive for my IBM PC) is approximately 450,000 times higher than it is today). If you are inclined toward JPEG compression, read about and think about using tiled TIFFs with JPEG compression in the tiles; it's a reasonable compromise that saves space while reducing the whole-file-read overhead of JPEG.
Where the heck is that unsubscribe button?
- Ed
On 9/15/08 9:23 PM, "Paul Spencer" <pspencer at dmsolutions.ca> wrote:
Jim, you would think that ;) However, in practice I wouldn't expect
the disk access time for geotiffs to be significantly different from
jpeg if you have properly optimized your geotiffs using gdal_translate
-co "TILED=YES" - the internal structure is efficiently indexed so
that gdal only has to read the minimum number of 256x256 blocks to
cover the requested extent. And using gdaladdo to generate overviews
just makes it that much more efficient.
Even if you are reading less physical data from the disk to get the
equivalent coverage from jpeg, the decompression overhead is enough to
negate the difference in IO time based on Ed's oft quoted advice (and
other's experience too I think). The rules that apply in this case
seem to be 'tile your data', 'do not compress it' and 'buy the fastest
disk you can afford'.
Compression is useful and probably necessary if you hit disk space
limits.
Cheers
Paul
On 15-Sep-08, at 5:48 PM, Jim Klassen wrote:
> Just out of curiosity, has anyone tested the performance of Jpegs
> vs. GeoTiffs?
>
> I would expect at some point the additional disk access time
> required for GeoTiffs (of the same pixel count) as Jpegs would
> outweigh the additional processor time required to decompress the
> Jpegs. (Also the number of Jpegs that can fit in disk cache is
> greater than for similar GeoTiffs.)
>
> For reference we use 1000px by 1000px Jpeg tiles (with world files).
> We store multiple resolutions of the dataset, each in its own
> directory. We start at the native dataset resolution, and half that
> for each step, stopping when there are less than 10 tiles produced
> at that particular resolution. (I.e for one of our county wide
> datasets 6in/px, 1ft/px, 2ft/px, ... 32ft/px). A tileindex is then
> created for each resolution (using gdaltindex followed by shptree)
> and a layer is created in the mapfile for each tileindex and
> appropriate min/maxscales are set. The outputformat in the mapfile
> is set to jpeg.
>
> Our typical tile size is 200KB. There are about 20k tiles in the 6in/
> px dataset, 80k tiles in the 3in/px dataset (actually 4in data, but
> stored in 3in so it fits with the rest of the datasets well). I have
> tested and this large number of files in a directory doesn't seem to
> effect performance on our system.
>
> Average access time for a 500x500px request to mapserver is 300ms
> measured at the client using perl/LWP and about 220ms with shp2img.
>
> Machine is mapserver 5.2.0/x86-64/2.8GHz Xeon/Linux 2.6.16/ext3
> filesystem.
>
> Jim Klassen
> City of Saint Paul
>
>>>> "Fawcett, David" <David.Fawcett at state.mn.us> 09/15/08 1:10 PM >>>
> Better yet,
>
> Add your comments to:
>
> http://mapserver.gis.umn.edu/docs/howto/optimizeraster
>
> and
>
> http://mapserver.gis.umn.edu/docs/howto/optimizevector
>
> I had always thought that all we needed to do to make these pages
> great
> was to grok the list for all of Ed's posts...
>
> David.
>
> -----Original Message-----
> From: mapserver-users-bounces at lists.osgeo.org
> [mailto:mapserver-users-bounces at lists.osgeo.org] On Behalf Of Brent
> Fraser
> Sent: Monday, September 15, 2008 12:55 PM
> To: mapserver-users at lists.osgeo.org
> Subject: [mapserver-users] Ed's Rules for the Best Raster Performance
>
>
> In honor of Ed's imminent retirement from the Mapserver Support Group,
> I've put together "Ed's List for the Best Raster Performance":
>
>
> #1. Pyramid the data
> - use MAXSCALE and MINSCALE in the LAYER object.
>
> #2. Tile the data (and merge your upper levels of the pyramid for
> fewer
> files).
> - see the TILEINDEX object
>
> #3. Don't compress your data
> - avoid jpg, ecw, and mrsid formats.
>
> #4. Don't re-project your data on-the-fly.
>
> #5. Get the fastest disks you can afford.
>
>
> (Ed, feel free to edit...)
>
> Brent Fraser
> _______________________________________________
> mapserver-users mailing list
> mapserver-users at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/mapserver-users
> _______________________________________________
> mapserver-users mailing list
> mapserver-users at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/mapserver-users
>
> _______________________________________________
> mapserver-users mailing list
> mapserver-users at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/mapserver-users
__________________________________________
Paul Spencer
Chief Technology Officer
DM Solutions Group Inc
http://www.dmsolutions.ca/
_______________________________________________
mapserver-users mailing list
mapserver-users at lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/mapserver-users
More information about the MapServer-users
mailing list