MS RFC 22a: Feature cache for long running processes and query processing (update)

Thu Jun 28 08:31:33 EDT 2007

2007/6/28, Daniel Morissette <dmorissette at mapgears.com>:

> Life would be so much easier for us if all our users were of the type
> you describe here. Unfortunately my experience differs from yours: in my
> experience I have seen all sorts of users, some beginners, some
> experienced, some who read docs and some who don't. I have also learned
> that even the talented developers prefer when the software behaves in a
> natural way, does "the right thing" by default, and when the interface
> to use a feature is simple.

Daniel,

That cannot be kept up in any case especially when the addressed problems
are extensive and complex by nature. I don't think if we have only simple
tools to solve the problems in the world. For example SWIG itself provides
a fairly abstract way for creating the wrapper and it's quite difficult for the
average user to follow what's happening behind the scenes. Though we
have hundreds of pages documenting the issues I consider most of us
might have to dig into the actual implementation to understand much of
the aspects. But it does not prevent the users to utilize the capabilities of
that project.

Moreover in this particular case the user can continue to think about
data providers
and layers. The only addition is that some layers can use other layers
to obtain the features. I think that this approach is new but not
unnatural from the aspect of the mapserver project.

>
> That being said, if what RFC-22a proposes is the simplest possible
> solution for the double-pass query then so be it (at least it's a
> solution), but you can be assured that this approach will prompt all
> sorts of questions from all sorts of users, especially with respect to
> the way it solves the double-pass query issue which is my main concern
> in this discussion.
>

If we think about this problem as a deficiency of some particular providers
that it should be solved by that provider silently. But I think caching data
for the subsequent renderings and for storing/preprocessing and
post-processing the query results is a bit more than a fix for an issue.
This RFC addresses new features more than fixes for existing problems.

> >
> > Or alternatively we could focus only on the 2 pass query problem by not
> > utilizing the vtable. It will possibly require either to modify all of
> > the mapserver
> > code involved in the query operations or modify all of the providers
> > suffering
> > from this particular problem. This might require a large amount of changes
> > in the existing code and would solve at most 20% of the problems I've
> > addressed.
> >
>
> Well, that 20% (the double-pass query) is the one that keeps coming back
> every once in a while. The other 80% are bonus features for which there
> has been very little demand so far.

I'm not totally sure about that. Many of the people are involved in handling
the query results that might require to represent the features in
separate inline
layers and creating a fair amount of code around the mapserver core to
process the results. The cause of the little demand is that they don't expect
that mapserver will ever support some built in solution on that.

>
> Let's keep in mind that "MapServer is not a full-featured GIS system,
> nor does it aspire to be". Transforming features on the fly is nice, but
> that kind of processing has never been MapServer's focus. I believe
> other tools such as PostGIS support this kind of operations and I always
> had a preference for letting them offer those features and letting
> MapServer concentrate on what it does best: publish maps on the web.
>

I consider that statement is rather an apology than a design goal. These
 query/selection processing options are related to how mapserver represents
the results on the map which - I think - is one of the primary objectives.

I'm not sure about the Web mapping usage restriction either. There are
a number of existing applications use mapserver as the rendering engine
in a desktop environment with long running processes as well.

Though PostGIS might have such support for applying data
transformations but using that might be suboptimal in many cases.
For example:

1. The user would not like to rely on a particular provider and wants
a provider neutral solution
2. The processing should be done in-process and the frequent database
access is undesired in the query operations.
3. Would want to feature various providers in the processing. For
example the selection shapes for the filter would come from a
mapinfo.tab data source.
4. Would want a support custom style for the renderings base on some
external data using styleitem "auto" with some additional providers
in the future.
5. Using queries based on other query results. Would eliminate the
need to pass the results back to the data source using some additional
interfaces unrelated to mapserver.
...

> I think the users who need a Web GIS should be looking more at MapGuide
> than MapServer.

If we always direct the people to other projects to solve the problems and
keep mapserver to support only a narrow subset of the Web GIS capabilities
then the project will eventually die.

I'm not aware of MapGuide so much. But the user guide has some interesting
chapters. Like the "Creating Buffers Around Map Features" one.

>
> Let's say that the first WhichShapes call loads 1,000 shapes, and then I
> do a query by point on that layer. Since there is no spatial index in
> memory, all the shapes in the cache will have to be accessed to identify
> the ones that are within tolerance of the query location. Sure, looking
> up the bounds of 1000 shapes is not a huge cost, but it's a cost, on top
> of all the memory used to cache all those shapes.
>

This question is related to how the cache is implemented but does not invalidate
the overall concept. I consider the actual implementation might change inside
the providers to handle the lookups more efficiently. At the moment I would
rather establish the framework than the fully implemented solution inside.
I can eventually ensure an option that the cache will be reconstructed every
time the WhichShapes is called but there's no chance to distinguish between
the various purposes of the providers accesses (query or rendering).

> OTOH, if the data provider supports a spatial index it can find the
> matching shapes (2 or 3 shapes in general) with very little work using
> its spatial index, removing any benefit of caching and without the cost
> of all the memory used to cache features.

It's up to the developer to decide which one is desired to use in a particular
situation. But if he gets poor performace using the former way he might
want to test with the other option.
Even the spatial indexes can be implemented inside the cache. However
the construction of the index might cause some additional load.

The increased memory usage is inevitable with the cache. If the memory
footprint is critical than using the cache might not be an optimal solution
to enhance the performance.

>
> Of course if I render the same map area 20 times in a persistent process
> then I will benefit from the cache, but I never wrote any MapServer
> application that does that. The typical application renders a map once
> and then moves to a new area or zooms in a separate request which does
> not benefit from caching, so there is little benefit to caching when
> rendering a map.

This is only an initial solution. In the future I'm planning to support caching
features from multiple extents and providing a capability to share the cache
inside the process and support for the persistence of the data.

>
> OTOH there would be real benefits to caching the first pass of a
> double-pass query since we are assured that we'll read the shapes twice
> in this case, and there are usually very few shapes to cache. Thinking
> about it some more I think I'd like to see a mode of operation of the
> cache that only caches queries.

So as to support such operation the provider should be capable to distinguish
between the purposes of the various WhichShapes calls. I think it may
confise the original concept that the provider would not be aware of
the clients. I would rather inspire to define
separate layers for the query and rendering purposes.

> >
> >> and even if it has a NextItem method
> >> to walk through all objects, the order of objects is not maintained by a
> >> hashtable, so if a user has data sorted (by sortshp) then the sort oder
> >> will be lost and rendering order will become pseudo-random if done via a
> >> cache layer (unless I'm missing something?).
> >>
> >
> > That's true. I'm not aware of the order of the renderings in this case.
> > In my practice I haven't found such a problem it was required.
> > However we could use an additional list to treat this issue if it is
> > significant.
> >
>
> This ordering of shapes at render time is a feature of MapServer, hence
> the command-line program sortshp. I don't use it myself but some users
> must rely on it otherwise it would not exist. I think it's a sad
> side-effect to not try to maintain the ordering but I'll let those who
> need this feature fight for it.

Currently there's no official mechanism that the provider notify about the
order other, than the sequence how the shapes are passed back from the
NextShape call. It might be sufficient when only one extent is cached
(like now) but there's no option to identify the order among the subsequent
fetches. We might also rely on the shapeindex/tileindex values but it haven't
officially declared that these values show the order or even preserved among the
subsequent queries.
So the conception behind this option should be clarified first.

Best regards,

Tamas