[postgis-users] Histogram2d formation

Tue Oct 8 15:42:41 PDT 2002

> This should work well for large query windows, and somewhat accurate for
> small ones. What think?
>
> dave

Seems like this is quickly turning into "real" work. :-)

Re small windows, the overall goal here is to give the planner decent 
information about how much data will come back so it can decide whether to 
use the index or not.

If the histogram returns a number that's too small, then the geographic index 
will always be used because it'll appear to be better than almost any other 
index that might be available. If the histogram cell size is large, then this 
would be a problem--the cost of actually using the geographic index wouldn't 
be reflected by the histogram.

If the cell size is small enough (relative to the query size), then returning 
a lowball estimate from the histogram won't be a big deal. The worst case is 
that the histogram returns 1 while the query actually returns a complete 
cell's worth of rows. Make the cell size small "enough" and this won't 
matter. It makes me think the key issue is the relative size between the 
query window and the histogram cells.

As a thought experiment, suppose we were trying to create a histogram for the 
48 continental states (I'd say North America, but I don't have data for 
Canada or Mexico :-). How much real-world area would each histogram cell 
cover?

The Tiger data that I'm playing with is organized by county. I've noticed that 
Los Angeles county has over 100,000 street segments, while some of the more 
backwater counties only have a few thousand--a dynamic range of about 30 to 
1. If the query planner wouldn't be able to distinguish between LA county and 
East Bumscratch county in Pennsyltucky, then we'd have to resort to other 
methods to make sure indexes got used.

For what it's worth, I haven't had any problems yet with PG not using a GiST 
index. However, my more interesting queries are written in two parts. The 
first is a very basic SQL query designed to use the appropriate index and 
then a second pass in PLPGSQL runs over the rows that come back and weeds out 
the records I don't want.