[postgis-devel] Re: getrelid && list_nth

Thu Jun 10 06:04:13 PDT 2004

Hi strk,

> -----Original Message-----
> From: 'strk' [mailto:strk at keybit.net] 
> Sent: 10 June 2004 13:24
> To: Mark Cave-Ayland
> Cc: postgis-devel at postgis.refractions.net
> Subject: Re: [postgis-devel] Re: getrelid && list_nth

(lots more cut) 

> Interesting. Am I right saying this computation would make an 
> extent MUCH smaller for your corner case ?

Well IANAM (I am not a mathematician) but the calculation *should* bias
results further away from the mean as being more erroneus according to a
normal distribution. Really need to dig out some of those old maths
textbooks ;) The test case is easy: load in a real data set and then add
a couple of really large points like POINT(1.0E20 1.0E20) and do an
ANALYZE; this will give a really unbalanced histogram which makes really
bad choices for query plans.

It would be useful to get some data and play in a spreadsheet but I
haven't got time to do that right now...... The other option would be to
simply filter out any data above the 2 SD threshold and not use it when
calculating the extents - but then we will lose a small percentage of
good data.... although we can increase the limit to 2.5 SDs to make sure
we lose as little as possible. Methinks some experimentation is required
:)

> We already make two scans BTW:
> @1357@
>         /*
>          * First scan:
>          *  o find extent of the sample rows
>          *  o count null/not-null values
>          *  o compute total_width
>          *  o compute total features's box area (for avgFeatureArea)
>          */
> @1357@
>         /*
>          * Second scan:
>          *  o fill histogram values with the number of
>          *    features' bbox overlaps: a feature's bvol
>          *    can fully overlap (1) or partially overlap
>          *    (fraction of 1) an histogram cell.
>          *
>          *  o compute total cells occupation
>          */
>  
> As you can see the first scan could also compute standard 
> deviation and handle 0-extent case.

I'm not sure we can do this.... we need the mean first so that's one
iteration to find the mean before we can calculate the SD (second
iteration) .... and we need the SD to calculate the histogram extents
before we can do the main computation (third iteration). I'm not sure I
can see a way around this?

> Thinking deeper about these cases, we should probably 
> abandone BoxesPerSide and use Columns / Rows instead as a 
> bunch of points laying on the same horizontal line would 
> require many coluns and a single row... We could calculate 
> the width/height factor and use that to split the 
> geometry_stats_target*160 in Rows and Columns. We would end 
> up with always-near-square histogram cells but I don't see a 
> big problem about it. Moreover, we could set a 
> minimum-histogram-cell size which would in turn automate the 
> extent-enrlargment you were suggesting.

Yes, that is a good point about having lots of points on the same
horizontal line.... another corner case!

Cheers,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.