[geos-devel] QgsVectorLayerCache

Mon Jun 15 07:52:45 PDT 2015

Hi Matthias

On 15.06.2015 15:34, Matthias Kuhn wrote:
> Hi Sandro
>
> On 06/15/2015 02:42 PM, Sandro Mani wrote:
>> Hello Matthias and List
>>
>> I have two questions about the QgsVectorLayerCache, which you
>> (@Matthias) have implemented [1].
>>
>> 1. First the easy one:
>> https://github.com/qgis/QGIS/blame/master/src/core/qgsvectorlayercache.cpp#L97
>>
>> -> I'm not sure what this block of code is supposed to do. As far as I
>> can see it just performs an empty iteration over all layer features,
>> but has no effects otherwise. Am I missing something? This block of
>> code is executed when loading the attribute table. If I comment it, I
>> can't spot any side effects, except that the loading of the attribute
>> table is faster.
> It's a QgsCachedFeatureWriterIterator which fills the cache when
> iterating over it.
> It's only being invoked when full caching is requested to avoid
> incremental population of the cache with a lot of subsequent requests.
> If you have slow round-trips and disable this code the effect should be
> noticeable. If it's not, there's something wrong with it.
Uhm ok, I'll need to investigate this together with my AFS provider 
implementation, somehow the result of that code block was that all 
features were fetched from the server twice.
>
>> 2. Secondly, to the vector layer cache in general.
>> Some background: I've done an initial implementation of an ArcGIS
>> Feature Service ("AFS") data provider, and similarly to the WFS
>> provider, the question of intelligent caching arises, to reduce
>> round-trips with the server. The WFS provider just caches all features
>> in memory (if the corresponding option is checked), which is
>> suboptimal for large datasets.
>> I've hence been thinking about implementing a local disc-based cache
>> (say in the form of an SpatiaLite DB), which acts as a local feature
>> cache. The usefulness of this could however go beyond just WFS and
>> AFS, to include all non-local data sources. So my idea is to implement
>> something like a QgsVectorDataProviderProxy which
>>
>> - overrides getFeatures to return a QgsCacheProxyFeatureIterator: this
>> iterator first checks whether the Feature is cached, and if not, only
>> then fetches it from the data provider. If the QgsFeatureRequest
>> includes an extent, entire pages of features could be loaded from the
>> disk to memory (up to a specified threshold).
>>
>> - overrides all add/change/delete methods to ensure that the cache
>> remains consistent.
>>
>> Actually I think the most elegant approach would be to have
>> QgsVectorLayer::mDataProvider be an instance of this
>> QgsVectorDataProviderProxy. If the data source is local, the calls are
>> simply forwarded to the actual data provider, otherwise, the above
>> outlined behavior applies.
>>
>> So (@Matthias): such an implementation would pretty much overlap with
>> what you have implemented, but does the work directly at provider
>> level. What are your thoughts on this? From your experience
>> implementing [1], do any alarm bells start ringing?
> I have thought about this approach as well as it seems to be very nice
> to have one shared cache which is able to provide several consumers with
> cached data (canvas, attribute table...). Do you think you will be
> introducing a size limit?
Speaking of the disk-cache: Yes, I suppose that would make sense, 
perhaps as a configurable option in the user preferences. For memory 
cache, there clearly would be a size limit.
>
> One risk I see is, that if you have different consumers (with a shared
> cache), they have different requirements.
> For the canvas the requirement is usually to have some spatial index
> that keeps track of which regions are cached and if a new request can be
> satisfied. It would be even easy/nice to do some look-ahead to pre-load
> features or only load part of the canvas if a big region is already
> loaded or do tiling.
Right, I'd like to model the cache around an idea of "pages", i.e. 
entire spatial regions which can be swapped in and out of memory 
depending on the current region of interest.
>
> If another consumer then does a second request without a spatial filter
> (none or attribute filter instead) it may fetch a lot of features and
> pollute your cache with these features. If there's a size limit of the
> cache it can then be cleaned of previous features which would still be
> more important for drawing then the ones fetched for a different
> consumer which may have been requested just once.
Yes I see the problem. First, one would need to investigate how 
expensive such cache trashings are compared to the situation with no 
cache at all. Then I suppose the usual ideas are things like having an 
access time stamp on the page loaded in memory, and if a page needs to 
go, the one last accessed furthest back will get thrown out.
>
> You will also have to take care of multithreading since multiple
> iterators can run at the same time.
Definitely.
>
> It's probably also required to spend some thoughts on how to invalidate
> the cache if the source data changes. (A time-based limit, a button to
> trigger a reload...).
Perhaps it would generally be a good idea to have a clear user-facing 
entry in the layer context menu to re-sync the entire provider data with 
the data source.
>
> If this is implemented, it would surely be nice to have it not only for
> AFS but also for other providers. Either way I would leave the choice to
> the user if he wants to use it or not.
Sure, this would be a user-configurable option, which ideally would just 
decide whether QgsVectorLayer::mDataProvider receives an actual provider 
instance of a cache proxy instance.

> If there's a request with a subsetOfAttributes set or without geometry,
> it's important to know if the request is going to be changed before
> sending it to the provider (so the cache contains all the information
> but the request may take longer) or if the request is going to be sent,
> requesting a reduced amount of information but not going to be cached.
> Or if it's going to be cached with reduced information, but then it has
> to be ensured later on that a subsequent request does not receive less
> information than it requires.

I'd go with fetching the reduced feature from the data provider, and not caching it, for a start at least. There are clearly more nifty approaches to be explored later on ;)

>
> I hope there are some good inputs in here
Yes definitely, thanks!

Sandro