[postgis-devel] Re: getrelid && list_nth

'strk' strk at keybit.net
Thu Jun 10 07:09:54 PDT 2004


On Thu, Jun 10, 2004 at 02:04:13PM +0100, Mark Cave-Ayland wrote:
> Hi strk,
> 
> > -----Original Message-----
> > From: 'strk' [mailto:strk at keybit.net] 
> > Sent: 10 June 2004 13:24
> > To: Mark Cave-Ayland
> > Cc: postgis-devel at postgis.refractions.net
> > Subject: Re: [postgis-devel] Re: getrelid && list_nth
> 
> (lots more cut) 
> 
> > Interesting. Am I right saying this computation would make an 
> > extent MUCH smaller for your corner case ?
> 
> Well IANAM (I am not a mathematician) but the calculation *should* bias
> results further away from the mean as being more erroneus according to a
> normal distribution. Really need to dig out some of those old maths
> textbooks ;) The test case is easy: load in a real data set and then add
> a couple of really large points like POINT(1.0E20 1.0E20) and do an
> ANALYZE; this will give a really unbalanced histogram which makes really
> bad choices for query plans.
> 
> It would be useful to get some data and play in a spreadsheet but I
> haven't got time to do that right now...... The other option would be to
> simply filter out any data above the 2 SD threshold and not use it when
> calculating the extents - but then we will lose a small percentage of
> good data.... although we can increase the limit to 2.5 SDs to make sure
> we lose as little as possible. Methinks some experimentation is required
> :)

I've separated the estimator code in an
estimate_selectivity(BOX *search_box, GEOM_STATS *geomstats) function
for easier mantainance.

I've handled there complete containment and complete missing of histogram
extent by search box, which catches my single-point histogram and speeds up
some corner cases on 'normal' datasets avoiding histogram scan.

(cuts)

> > As you can see the first scan could also compute standard 
> > deviation and handle 0-extent case.
> 
> I'm not sure we can do this.... we need the mean first so that's one
> iteration to find the mean before we can calculate the SD (second
> iteration) .... and we need the SD to calculate the histogram extents
> before we can do the main computation (third iteration). I'm not sure I
> can see a way around this?

Nope. you are right. We can't. I was wrong.

I don't think making three scans is that bad, as we cache bounding boxes,
so no further fetch is required. Memory is occupied anyway ... and we
are vacuuming, so we have time (we provide hooks for interruptions also!).

I would not filter out sampled geometries, unless they are invalid
(!finite geoms shoul be added for example). I'll try your code
and make extent reduction be printed on DEBUG 1.

>  
> > Thinking deeper about these cases, we should probably 
> > abandone BoxesPerSide and use Columns / Rows instead as a 
> > bunch of points laying on the same horizontal line would 
> > require many coluns and a single row... We could calculate 
> > the width/height factor and use that to split the 
> > geometry_stats_target*160 in Rows and Columns. We would end 
> > up with always-near-square histogram cells but I don't see a 
> > big problem about it. Moreover, we could set a 
> > minimum-histogram-cell size which would in turn automate the 
> > extent-enrlargment you were suggesting.
> 
> Yes, that is a good point about having lots of points on the same
> horizontal line.... another corner case!

I think this is a general issue, as geometries have usually the
same resolution in both directions, so there's no point in squeezing
histogram cells... On the other hand having near-squared histogram
cells would increase estimator precision using the same number of total
cells.

--strk;

> 
> 
> Cheers,
> 
> Mark.
> 
> ---
> 
> Mark Cave-Ayland
> Webbased Ltd.
> Tamar Science Park
> Derriford
> Plymouth
> PL6 8BX
> England
> 
> Tel: +44 (0)1752 764445
> Fax: +44 (0)1752 764446
> 
> 
> This email and any attachments are confidential to the intended
> recipient and may also be privileged. If you are not the intended
> recipient please delete it from your system and notify the sender. You
> should not copy it or use it for any purpose nor disclose or distribute
> its contents to any other person.
> 



More information about the postgis-devel mailing list