[postgis-devel] Re: getrelid && list_nth

Thu Jun 10 04:36:35 PDT 2004

Hi strk,

> -----Original Message-----
> From: 'strk' [mailto:strk at keybit.net] 
> Sent: 09 June 2004 11:46
> To: Mark Cave-Ayland; postgis-devel at postgis.refractions.net
> Subject: Re: [postgis-devel] Re: getrelid && list_nth
> 
>
> Ok. I've found the reason... it is due to the fact that
> my table contained only equal points, so the histogram 
> consisted in 40x40 infinitesimal cells... Whe should probably 
> handle similar cases quickly returning 0 for data-point not 
> contained overlap search-box and 
> 1 for data-point contained in search-box
> 
> what do you think ?
> 
> --strk;

Ahhhh another corner case! :)

I think the simplest thing to do in this case would be to expand the
area of the histogram by a small amount, for example if
sample_extent->LLB.x == sample_extent->URT.x then:

geomstats->xmin = sample_extent->LLB.x - 1.0
geomstats->xmax = sample_extent->URT.x + 1.0

and similarly if if sample_extent->LLB.y == sample_extent->URT.y then:

geomstats->ymin = sample_extent->LLB.y - 1.0
geomstats->ymax = sample_extent->URT.y + 1.0 

The other corner case I have seen was when one of our import scripts
went wrong and we ended up with several very large x/y coords in our
table along with our normal data. When we did an ANALYZE, because the
overall extents were so large, all of the geometries appeared in a
single box in the bottom left hand corner of histogram and hence
provided really bad query plans - deleting these erroneus geometries and
doing ANALYZE solved the problem and everything went back to normal.

The solution I thought of was to try and reduce the effect of outliers
using some statistical theory: for example we know that for a random
sample of data, 95% of the data lies within 2 standard deviations of the
mean. So I was considering calculating the mean of all LLB.x/y and
URT.x/y, using this to calculate the standard deviation, and hence
calculate the geomextents from this. This should hopefully reduce the
effect of excessively large coordinates on the extents of the
selectivity histogram at the cost of making the histogram area slightly
smaller than that currently indicated by the overall sample extents.

The pseudo-code example for LLB.x would look something like:

	// Calculate mean
	for each LLB.x in sample
		sumLLB.x += LLB.x
	next
	meanLLB.x = sumLLB.x / samplerows

	// Calculate standard deviation
	for each LLB.x in sample
		s2 = (LLB.x - meanLLB.x) * (LLB.x - meanLLB.x)
	next
	s2 /= (samplerows - 1)
	s = sqrt(s2)

	// For 95% coverage we should lie within 2sds of the mean
	geomstats->xmin = meanLLB.x - 2 * s

The downside is that we have to iterate through the sample tuples twice,
once to calculate the mean and then once again to calculate the standard
deviation. However, since we are only using a sample of rows from the
table then hopefully the effect on larger tables will be negligible.
Comments anyone?

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.