[postgis-devel] Re: geometry stats

Thu Mar 4 02:24:34 PST 2004

On Wed, Mar 03, 2004 at 06:17:18PM -0000, Mark Cave-Ayland wrote:
> Hi strk,
> 
> > -----Original Message-----
> > From: strk [mailto:strk at keybit.net] 
> > Sent: 01 March 2004 16:08
> > To: Mark Cave-Ayland
> > Cc: 'David Blasby'; postgis-devel at postgis.refractions.net; 
> > 'Paul Ramsey'
> > Subject: Re: [postgis-devel] Re: geometry stats
> > 
> > 
> > I've committed the change. It uses 40 to 400 boxesPerSide.
> > (just to keep current 40x40 default).
> > 
> > --strk;
> 
> 
> Thanks for this. I've just tried testing the new code on a layer of
> polygons and found that the estimates returned were just far too low
> (e.g. return estimated 20 compared to actual 800!). This is compared to
> the good results obtained when using a layer of points.
> 
> I think the problem lies in the section of code where the AOI is
> calculated. Thinking about this, for a point layer then the histogram
> box is incremented by one *per point* where as for a polygon it was
> incremented by (1 / area). This means that if the whole histogram box
> were covered by non-overlapping polygons then the histogram box would
> ever only be increased by one for the entire set of polygons (!).

Nope. You're wrong here.
Histogram box is incremented by one per *fully overlapping* sample.
If a polygon bbox overlaps a 3x3 portion of the histogram grid
the central cell will be incremented by 1, while the peripheral 8
cells will be incremented by the portion of actual overlap.
This means that if the whole histogram box were covered by
non-overlapping polygons *leaving no holes in the histogram*
every cell would get a value of 1.

> It seemed that since the data is normalised then we should just
> increment the histogram box by one if the polygon bounding box intersets
> the histogram box. I experimented with only allowing polygons > 5% of
> the histogram box size to increment its value, but that filtered out a
> lot of polygons and the estimates became quite far out.
> 
> So the best results I obtained were by removing the AOI / area increment
> and replacing it by 1, e.g. changing the following around line 1388 so
> the algorithm was the same for points/polygons:

(cuts)

Again its AOI / *cell* area.
I think it can be removed, but I did not remove it yet because
the biggest errors are introduced by the estimation code, and
I'd like -again- to keep the histogram as strict as possible
in meaning (cell value provide the 'crowd factor').

> The estimates for the polygon layers are definitely not as good as the
> point layers (average error can be +/- 20%) so perhaps the algorithm
> could do with some more refining. Any thoughts about improving this
> would be greatly appreciated :)

This is probably due to a 'gain' I applied to the sum of the search_box
overlapping cells values. This gain is applied dividing the sum
by the minimun value between the number of cells which the search_box
overlaps and the average number of cells which a feature overlaps.

This is probably not appropriate.

What I was trying to handle is the fact that a single feature
could be counted many times by the estimator.

An example: a single polygon in the dataset gives an histogram with
all '1' in its cells (1/1 after normalization).
With this stats a search_box will get a value being exactly the number
of cells it overlaps, while the actual result should be just 1.
With my algorithm it would have taken always 1, because the sum
would be divided by the number of cells overlap by the search_box.
Does it sound ?

This problem will increase with the increasing number of boxesPerSide,
because every sample feature will overlap far more histogram cells
and thus increase the estimated count.

We need a way to guess how many duplicates we have counted.
For point layers this would be about none (apart exceptional
points falling on cells borders - thus appearing in 2-4 cells ).

There is a long comment block just before the application of the
gain in the estimator expressing this need. My bet was to use
the intersection between number of cells overlapped by the search_box
and the number of cells overlapped by the average feature. 
This 'intersection' was computed as the minimun value between the
two factors. This can only lead to a reduction of the computed
sum, intentionally removing counted duplicates.

Now that I removed that 'final' gain please make some more tests
to better understand the problem. 

> Finally, I found that there were circumstances where rounding errors in
> the mathematics would return a 'selectivity out of range' error by
> returning something like 1.0000000001, so I added the following at the
> bottom of the selectivity function just before the free_attstatslot()
> call:
> 
> 
> 	// Make sure that the selectivity will be returned ok
> 	if (selectivity > 1.0)
> 		selectivity = 1.0;
> 
> 	if (selectivity < 0)
> 		selectivity = 0.0;
> 

committed.

> 
> More when time allows :)
> 
> Mark.

--strk;