shapefile optimization for dynamic data

Fri Apr 21 14:52:14 PDT 2006

Ben -

Thanks for the follow-up; storing your data on a RAM disk is not a
typical scenario!  And that (I think) is a major reason you're seeing
little benefit from a shptree index.  The chief benefit of a shptree
index is that it allows MapServer to avoid reading objects from a
shapefile that can't possibly be needed for the current request.

If your shapefile is relatively small, external factors (processing the
MAP file, locating the layer shapefile, opening the shapefile) will take
proportionally more of the overall time, so the actual shapefile reading
is a smaller piece of the puzzle and therefore less benefit to optimize.
That is, if the actual shapefile reading makes up 30% of the entire
MapServer runtime, and your indexing makes reduces the reading time by
50%, your request will only be 15% faster.

In addition, disk seeking and reading is one of the slowest things you
can do on a computer - unless it's a RAM disk, in which case it's one of
the fastest!  Since there are, of course, no moving parts in a RAM disk,
seeking from one "location" to another just means updating a pointer.
That's already so much faster that optimizing those "seeks" - i.e.
optimizing a few pointer updates - won't help much.

I think you're likely correct that the CPU speed is a major factor.
However, when you've tweaked everything else, pay attention to how much
time is spent simply processing the MAP file.  This is often not
obvious, since profiling reported with DEBUG ON must, of necessity, only
cover operations after the MAP file is read (because MapServer has to
read the MAP file to know you've included a DEBUG ON statement in it).

I found applications that I thought were "pretty good" that were
spending 3 times more time processing the MAP file than they spend
drawing the map!

     - Ed

Ed McNierney
TopoZone.com

-----Original Message-----
From: UMN MapServer Users List [mailto:MAPSERVER-USERS at LISTS.UMN.EDU] On
Behalf Of Ben Eisenbraun
Sent: Friday, April 21, 2006 4:49 PM
To: MAPSERVER-USERS at LISTS.UMN.EDU
Subject: Re: [UMN_MAPSERVER-USERS] shapefile optimization for dynamic
data

> Ben Eisenbraun wrote:
>> I'm collecting data via a GPS and a sensor that reports a data
>> point once per second. I'm using Mapserver CGI to generate an
>> overlay onto a map via a javascript frontend that auto-refreshes
>> every few seconds. The application has to run on a low-power
>> embedded hardware device (roughly a p2-266), and I'm running into
>> performance problems once I've collected a few thousand data
>> points. The Mapserver CGI process tends to consume all the CPU
>> trying to render the overlays.
<snip my previous post>
> Stephen Woodbridge wrote:
> Using shptree will not help you that much in this scenario because of
> the frequency of updating the file. You best bet would be use multiple
> files and a tile index that you would have to add the new files to as
> they are created. Then you can shptree on the non-active file, but not
> on the active file. That will probably be the best scenario. Also make
> sure you shptree the tileindex.

A little follow up:

I tried this route.  I wrote the shapefile generation scripts so that
you could set a max-points per shapefile, and the system would create a
shapefile of, e.g., 1000 points, shptree it, add it to the tile index,
shptree the tile index, and then create a new unindexed shapefile for
adding the next batch of 1000 points.

It had basically no effect.  I tested a range of shapefile sizes from
5000 down to 50 points per shapefile with almost identical performance
at all sizes for 20,000 total points.  Testing with shptree indexes
versus no indexes was almost identical.  At the larger sizes, I saw
approximately 3-5% decrease in rendering time.

So... yuck.

Given how strongly recommended shptree indexes are on the list, I
thought my testing methodology might be flawed.

I have a list of URLs that represent tiles for the entire dataset as
well a list of URLs that are for common views of the data (zoom level
and number of points) that I used for all my tests.  I was using shell
scripts using 'time' and 'curl' to grab images across the network, and I
thought it might be a network or HTTP effect, but I was able to use
'shp2img' to duplicate the results locally.

> If a shapefile does not have a qix spatial index, then mapserver
creates
> one on the fly and throws it away. If you are adding a point a second
> the file is probably getting updated faster than you can index it and
> then render it. Using the tileindex should really help in this case
> also, because only the files the intersect you current display window
> need to be opened and looked at.

I think my shapefile sizes must be significantly smaller than the data
that most people are using.  20,000 points ends up being about 1.5 MB of
shapefiles with slightly larger dbf files for the attributes.  I'm
reading/writing these files to a ramdisk under Linux, so access should
be pretty quick.

My suspicion at this point is that the CPU is simply under specced for
this application.  It's not really a Pentium II; it's a low power
586-class Geode CPU with no level 2 cache.

The biggest performance increase was achieved by breaking out the points
into shapefiles based on their attributes.  I was previously creating a
single layer in the mapfile and using CLASS expressions to colorize the
features.  By pre-classifying the data into separate shapefiles, I was
able to decrease rendering time by 10-12%.

I never did end up checking out PostGIS or sqlite for this application,
but I'm not sure it would have helped.  The shapefile creation and
updates are actually relatively low overhead in comparison to generating
the overlays.

Thanks for everyone's suggestions.

-ben

--
this machine kills fascists.                           <woody guthrie>