[postgis-devel] Re: geometry stats

Mark Cave-Ayland m.cave-ayland at webbased.co.uk
Wed Mar 3 10:17:18 PST 2004


Hi strk,

> -----Original Message-----
> From: strk [mailto:strk at keybit.net] 
> Sent: 01 March 2004 16:08
> To: Mark Cave-Ayland
> Cc: 'David Blasby'; postgis-devel at postgis.refractions.net; 
> 'Paul Ramsey'
> Subject: Re: [postgis-devel] Re: geometry stats
> 
> 
> I've committed the change. It uses 40 to 400 boxesPerSide.
> (just to keep current 40x40 default).
> 
> --strk;


Thanks for this. I've just tried testing the new code on a layer of
polygons and found that the estimates returned were just far too low
(e.g. return estimated 20 compared to actual 800!). This is compared to
the good results obtained when using a layer of points.

I think the problem lies in the section of code where the AOI is
calculated. Thinking about this, for a point layer then the histogram
box is incremented by one *per point* where as for a polygon it was
incremented by (1 / area). This means that if the whole histogram box
were covered by non-overlapping polygons then the histogram box would
ever only be increased by one for the entire set of polygons (!).

It seemed that since the data is normalised then we should just
increment the histogram box by one if the polygon bounding box intersets
the histogram box. I experimented with only allowing polygons > 5% of
the histogram box size to increment its value, but that filtered out a
lot of polygons and the estimates became quite far out.

So the best results I obtained were by removing the AOI / area increment
and replacing it by 1, e.g. changing the following around line 1388 so
the algorithm was the same for points/polygons:


		/*
		 * the {x,y}_idx_{min,max}
		 * define the grid squares that the box intersects
		 */
		for (y=y_idx_min; y<=y_idx_max; y++)
		{
			for (x=x_idx_min; x<=x_idx_max; x++)
			{
				intersect_x = min(box->high.x,
geomstats->xmin + (x+1) * geow / bps) -
					max(box->low.x, geomstats->xmin
+ x * geow / bps );
				intersect_y = min(box->high.y,
geomstats->ymin + (y+1) * geoh / bps) -
					max(box->low.y, geomstats->ymin+
y * geoh / bps) ;
				//AOI = intersect_x*intersect_y;
				//if ( ! AOI ) // must be a point
				//{
				geomstats->value[x+y*bps] += 1;
				//}
				//else
				//{
			//geomstats->value[x+y*bps] += AOI / cell_area;
				//}
				numcells++;
			}
		}


The estimates for the polygon layers are definitely not as good as the
point layers (average error can be +/- 20%) so perhaps the algorithm
could do with some more refining. Any thoughts about improving this
would be greatly appreciated :)

Finally, I found that there were circumstances where rounding errors in
the mathematics would return a 'selectivity out of range' error by
returning something like 1.0000000001, so I added the following at the
bottom of the selectivity function just before the free_attstatslot()
call:


	// Make sure that the selectivity will be returned ok
	if (selectivity > 1.0)
		selectivity = 1.0;

	if (selectivity < 0)
		selectivity = 0.0;


More when time allows :)

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446


This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.





More information about the postgis-devel mailing list