[postgis-users] Histogram2d formation
Andy Turk
andy at streetlight.com
Tue Oct 8 15:42:41 PDT 2002
> This should work well for large query windows, and somewhat accurate for
> small ones. What think?
>
> dave
Seems like this is quickly turning into "real" work. :-)
Re small windows, the overall goal here is to give the planner decent
information about how much data will come back so it can decide whether to
use the index or not.
If the histogram returns a number that's too small, then the geographic index
will always be used because it'll appear to be better than almost any other
index that might be available. If the histogram cell size is large, then this
would be a problem--the cost of actually using the geographic index wouldn't
be reflected by the histogram.
If the cell size is small enough (relative to the query size), then returning
a lowball estimate from the histogram won't be a big deal. The worst case is
that the histogram returns 1 while the query actually returns a complete
cell's worth of rows. Make the cell size small "enough" and this won't
matter. It makes me think the key issue is the relative size between the
query window and the histogram cells.
As a thought experiment, suppose we were trying to create a histogram for the
48 continental states (I'd say North America, but I don't have data for
Canada or Mexico :-). How much real-world area would each histogram cell
cover?
The Tiger data that I'm playing with is organized by county. I've noticed that
Los Angeles county has over 100,000 street segments, while some of the more
backwater counties only have a few thousand--a dynamic range of about 30 to
1. If the query planner wouldn't be able to distinguish between LA county and
East Bumscratch county in Pennsyltucky, then we'd have to resort to other
methods to make sure indexes got used.
For what it's worth, I haven't had any problems yet with PG not using a GiST
index. However, my more interesting queries are written in two parts. The
first is a very basic SQL query designed to use the appropriate index and
then a second pass in PLPGSQL runs over the rows that come back and weeds out
the records I don't want.
More information about the postgis-users
mailing list