[Qgis-developer] QgsVectorLayerCache

Mon Jun 15 07:56:22 PDT 2015

(Forwarding message to list...)

On 15.06.2015 16:52, Sandro Mani wrote:
> Hi Matthias
>
> On 15.06.2015 15:34, Matthias Kuhn wrote:
>> Hi Sandro
>>
>> On 06/15/2015 02:42 PM, Sandro Mani wrote:
>>> Hello Matthias and List
>>>
>>> I have two questions about the QgsVectorLayerCache, which you
>>> (@Matthias) have implemented [1].
>>>
>>> 1. First the easy one:
>>> https://github.com/qgis/QGIS/blame/master/src/core/qgsvectorlayercache.cpp#L97 
>>>
>>>
>>> -> I'm not sure what this block of code is supposed to do. As far as I
>>> can see it just performs an empty iteration over all layer features,
>>> but has no effects otherwise. Am I missing something? This block of
>>> code is executed when loading the attribute table. If I comment it, I
>>> can't spot any side effects, except that the loading of the attribute
>>> table is faster.
>> It's a QgsCachedFeatureWriterIterator which fills the cache when
>> iterating over it.
>> It's only being invoked when full caching is requested to avoid
>> incremental population of the cache with a lot of subsequent requests.
>> If you have slow round-trips and disable this code the effect should be
>> noticeable. If it's not, there's something wrong with it.
> Uhm ok, I'll need to investigate this together with my AFS provider 
> implementation, somehow the result of that code block was that all 
> features were fetched from the server twice.
>>
>>> 2. Secondly, to the vector layer cache in general.
>>> Some background: I've done an initial implementation of an ArcGIS
>>> Feature Service ("AFS") data provider, and similarly to the WFS
>>> provider, the question of intelligent caching arises, to reduce
>>> round-trips with the server. The WFS provider just caches all features
>>> in memory (if the corresponding option is checked), which is
>>> suboptimal for large datasets.
>>> I've hence been thinking about implementing a local disc-based cache
>>> (say in the form of an SpatiaLite DB), which acts as a local feature
>>> cache. The usefulness of this could however go beyond just WFS and
>>> AFS, to include all non-local data sources. So my idea is to implement
>>> something like a QgsVectorDataProviderProxy which
>>>
>>> - overrides getFeatures to return a QgsCacheProxyFeatureIterator: this
>>> iterator first checks whether the Feature is cached, and if not, only
>>> then fetches it from the data provider. If the QgsFeatureRequest
>>> includes an extent, entire pages of features could be loaded from the
>>> disk to memory (up to a specified threshold).
>>>
>>> - overrides all add/change/delete methods to ensure that the cache
>>> remains consistent.
>>>
>>> Actually I think the most elegant approach would be to have
>>> QgsVectorLayer::mDataProvider be an instance of this
>>> QgsVectorDataProviderProxy. If the data source is local, the calls are
>>> simply forwarded to the actual data provider, otherwise, the above
>>> outlined behavior applies.
>>>
>>> So (@Matthias): such an implementation would pretty much overlap with
>>> what you have implemented, but does the work directly at provider
>>> level. What are your thoughts on this? From your experience
>>> implementing [1], do any alarm bells start ringing?
>> I have thought about this approach as well as it seems to be very nice
>> to have one shared cache which is able to provide several consumers with
>> cached data (canvas, attribute table...). Do you think you will be
>> introducing a size limit?
> Speaking of the disk-cache: Yes, I suppose that would make sense, 
> perhaps as a configurable option in the user preferences. For memory 
> cache, there clearly would be a size limit.
>>
>> One risk I see is, that if you have different consumers (with a shared
>> cache), they have different requirements.
>> For the canvas the requirement is usually to have some spatial index
>> that keeps track of which regions are cached and if a new request can be
>> satisfied. It would be even easy/nice to do some look-ahead to pre-load
>> features or only load part of the canvas if a big region is already
>> loaded or do tiling.
> Right, I'd like to model the cache around an idea of "pages", i.e. 
> entire spatial regions which can be swapped in and out of memory 
> depending on the current region of interest.
>>
>> If another consumer then does a second request without a spatial filter
>> (none or attribute filter instead) it may fetch a lot of features and
>> pollute your cache with these features. If there's a size limit of the
>> cache it can then be cleaned of previous features which would still be
>> more important for drawing then the ones fetched for a different
>> consumer which may have been requested just once.
> Yes I see the problem. First, one would need to investigate how 
> expensive such cache trashings are compared to the situation with no 
> cache at all. Then I suppose the usual ideas are things like having an 
> access time stamp on the page loaded in memory, and if a page needs to 
> go, the one last accessed furthest back will get thrown out.
>>
>> You will also have to take care of multithreading since multiple
>> iterators can run at the same time.
> Definitely.
>>
>> It's probably also required to spend some thoughts on how to invalidate
>> the cache if the source data changes. (A time-based limit, a button to
>> trigger a reload...).
> Perhaps it would generally be a good idea to have a clear user-facing 
> entry in the layer context menu to re-sync the entire provider data 
> with the data source.
>>
>> If this is implemented, it would surely be nice to have it not only for
>> AFS but also for other providers. Either way I would leave the choice to
>> the user if he wants to use it or not.
> Sure, this would be a user-configurable option, which ideally would 
> just decide whether QgsVectorLayer::mDataProvider receives an actual 
> provider instance of a cache proxy instance.
>
>
>> If there's a request with a subsetOfAttributes set or without geometry,
>> it's important to know if the request is going to be changed before
>> sending it to the provider (so the cache contains all the information
>> but the request may take longer) or if the request is going to be sent,
>> requesting a reduced amount of information but not going to be cached.
>> Or if it's going to be cached with reduced information, but then it has
>> to be ensured later on that a subsequent request does not receive less
>> information than it requires.
>
> I'd go with fetching the reduced feature from the data provider, and 
> not caching it, for a start at least. There are clearly more nifty 
> approaches to be explored later on ;)
>
>>
>> I hope there are some good inputs in here
> Yes definitely, thanks!
>
> Sandro