[postgis-devel] Question on "libpgcommon/lwgeom_transform.c"

Tue Apr 23 15:56:05 PDT 2013

On 23/04/13 01:08, Stephen Woodbridge wrote:

> Hi Guys,
>
> On my continuing review of "libpgcommon/lwgeom_transform.c" in an effort
> to create something similar for the address standardizer, you are
> getting a detailed code review.
>
> The current question is related to the usage of
> PROJ4Cache->PROJ4SRSCacheCount in:
>
> AddToPROJ4SRSCache(PROJ4PortalCache *PROJ4Cache, int srid, int other_srid)
>
> It looks like the usage of this has morphed over time as it does not
> seem to reflect what it's name intends.
>
> At line 511, we use it to check if the cache is full
> At line 523, we use it to remember which item we just deleted
> At lines 553-555, we use it to fill the slot just deleted
> At line 556, we have a problem!
>
> If we delete any entry less than PROJ4_CACHE_ITEMS - 2, then
> PROJ4Cache->PROJ4SRSCacheCount++ will be less than a full cache and the
> next time we will fill that slot regardless, even if it is other_srid.
>
> While this is not a fatal error, I do not think it is what was intended
> and it leads to some confusion when trying to understand the code.
>
> I'll bug this if you concur with this analysis.
>
> I am thinking that I would like to implement a LRU algorithm for my
> replacement algorithm.
>
> I'm thinking of changing the counter to a index pointer to the next
> available slot. I think I can do this with:
>
> index = (index + 1) % NUM_CACHE_ITEMS;
>
> This isn't an LRU, but rather Least Recently Added (LRA) algorithm.
>
> To implement a LRU algorithm I would need some like
>
> int lru[NUM_CACHE_ITEMS];
>
> then when a slot is used set the lru[i] = 0; and increment all other
> slots (watching for integer overflow). Then when you need to add a new
> slot, you search for the highest lru[i] and use that slot and update the
> slot ranks.
>
> I like the both these, LRA is simple to implement but the LRU might give
> better results in a web portal like application. I think wrapping this
> in a function would allow the algorithm to be changed without impacting
> the code.

Hi Steve,

I can't really comment on the current version of the code because I 
haven't looked at it for a while, however the original version of the 
code was very simple in that it had a small number of slots (5?) that 
were just cycled through in order.

The thinking behind this was because there was a static PROJ.4 cache for 
each individual backend (and of course each backend can only run a 
single query at once) then you would likely have one of three scenarios:

1) The table in your current query contains geometries of a single SRID 
which you are querying with a query using a single ST_Transform()

2) The table in your current query contains a mixture of SRIDs but 
likely the majority of the table will contain less than 5 different SRIDs

3) The query itself contains a complicated set of subqueries with 
multiple ST_Transform() statements between different SRIDs

Case 1 is handled fine by such a cache algorithm as-is. Case 3 was the 
reason for increasing the number of cache slots to 5 rather than having 
just 2, so there was some leeway for multiple ST_Transform()s in a 
single query. Case 2 was the worst case scenario, in that with more than 
5 SRIDs all equally distributed throughout the table then the cache 
would cease to be effective.

Now I never came across a real-life degenerate version of case 2 to make 
the extra effort worthwhile (and to the best of my knowledge we've had 
no bug reports about it), but if you think that having more than 5 mixed 
SRIDs evenly distributed in a table and reprojecting the entire table is 
a common use-case then I won't object to a change in the algorithm 
and/or increase in the number of slots (although some real-life examples 
with numbers to show the before and after effects of your change would 
help to justify such a change).

HTH,

Mark.