[postgis-devel] Re: getrelid && list_nth

Thu Jun 10 05:24:08 PDT 2004

On Thu, Jun 10, 2004 at 12:36:35PM +0100, Mark Cave-Ayland wrote:
> Hi strk,
> 
> > -----Original Message-----
> > From: 'strk' [mailto:strk at keybit.net] 
> > Sent: 09 June 2004 11:46
> > To: Mark Cave-Ayland; postgis-devel at postgis.refractions.net
> > Subject: Re: [postgis-devel] Re: getrelid && list_nth
> > 
> >
> > Ok. I've found the reason... it is due to the fact that
> > my table contained only equal points, so the histogram 
> > consisted in 40x40 infinitesimal cells... Whe should probably 
> > handle similar cases quickly returning 0 for data-point not 
> > contained overlap search-box and 
> > 1 for data-point contained in search-box
> > 
> > what do you think ?
> > 
> > --strk;
> 
> Ahhhh another corner case! :)
> 
> I think the simplest thing to do in this case would be to expand the
> area of the histogram by a small amount, for example if
> sample_extent->LLB.x == sample_extent->URT.x then:
> 
> geomstats->xmin = sample_extent->LLB.x - 1.0
> geomstats->xmax = sample_extent->URT.x + 1.0
> 
> and similarly if if sample_extent->LLB.y == sample_extent->URT.y then:
> 
> geomstats->ymin = sample_extent->LLB.y - 1.0
> geomstats->ymax = sample_extent->URT.y + 1.0 

Makes sense, but it's a waste of time and resources keeping
building and analyzing an histogram when the full stat has
an extent reduced to a point. We could instead make a single box
histogram with a value of 1...

> The other corner case I have seen was when one of our import scripts
> went wrong and we ended up with several very large x/y coords in our
> table along with our normal data. When we did an ANALYZE, because the
> overall extents were so large, all of the geometries appeared in a
> single box in the bottom left hand corner of histogram and hence
> provided really bad query plans - deleting these erroneus geometries and
> doing ANALYZE solved the problem and everything went back to normal.
> 
> The solution I thought of was to try and reduce the effect of outliers
> using some statistical theory: for example we know that for a random
> sample of data, 95% of the data lies within 2 standard deviations of the
> mean. So I was considering calculating the mean of all LLB.x/y and
> URT.x/y, using this to calculate the standard deviation, and hence
> calculate the geomextents from this. This should hopefully reduce the
> effect of excessively large coordinates on the extents of the
> selectivity histogram at the cost of making the histogram area slightly
> smaller than that currently indicated by the overall sample extents.
> 
> The pseudo-code example for LLB.x would look something like:
> 
> 	// Calculate mean
> 	for each LLB.x in sample
> 		sumLLB.x += LLB.x
> 	next
> 	meanLLB.x = sumLLB.x / samplerows
> 
> 	// Calculate standard deviation
> 	for each LLB.x in sample
> 		s2 = (LLB.x - meanLLB.x) * (LLB.x - meanLLB.x)
> 	next
> 	s2 /= (samplerows - 1)
> 	s = sqrt(s2)
> 
> 	// For 95% coverage we should lie within 2sds of the mean
> 	geomstats->xmin = meanLLB.x - 2 * s
> 
> 
> The downside is that we have to iterate through the sample tuples twice,
> once to calculate the mean and then once again to calculate the standard
> deviation. However, since we are only using a sample of rows from the
> table then hopefully the effect on larger tables will be negligible.
> Comments anyone?

Interesting. Am I right saying this computation would make an
extent MUCH smaller for your corner case ?

We already make two scans BTW:
@1357@
        /*
         * First scan:
         *  o find extent of the sample rows
         *  o count null/not-null values
         *  o compute total_width
         *  o compute total features's box area (for avgFeatureArea)
         */
@1357@
        /*
         * Second scan:
         *  o fill histogram values with the number of
         *    features' bbox overlaps: a feature's bvol
         *    can fully overlap (1) or partially overlap
         *    (fraction of 1) an histogram cell.
         *
         *  o compute total cells occupation
         */

As you can see the first scan could also compute standard deviation
and handle 0-extent case.

Thinking deeper about these cases, we should probably abandone BoxesPerSide
and use Columns / Rows instead as a bunch of points laying on the
same horizontal line would require many coluns and a single row...
We could calculate the width/height factor and use that to split
the geometry_stats_target*160 in Rows and Columns. We would end up
with always-near-square histogram cells but I don't see a big problem
about it. Moreover, we could set a minimum-histogram-cell size which
would in turn automate the extent-enrlargment you were suggesting.

--strk;

> 
> 
> Cheers,
> 
> Mark.
> 
> ---
> 
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
> 
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
> 
> 
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
>