[postgis-users] Histogram2d formation

Tue Oct 8 12:55:33 PDT 2002

> How many rows in the dataset? Or another way to size the computation... If it
> only took a few seconds to calculate the histogram, how long did it take to
> create a GiST index for the same data?

There's about 195,000 rows in the geometry table.

>Looking at your .gifs, there seems to be a lot of "blank" areas where the
>histogram is close to zero. Could these be collapsed out with a quad-tree
>implementation that didn't need to represent equal-sized cells?

I was originally thinking of doing a quad-tree type histogram, but decided that
it would be too much work (and memory) for very little gain.

> > Next, I'm going to try to find a way of estimating the result set size
> > based on a query window.  If anyone has any ideas how to go about doing
> > this (esp for very small query windows), I'd like to hear from you.
> >
> > dave
>
> If you're keeping a count of the rows intersected by each grid cell, then you
> could calculate the sum of the grid cell values that are completely enclosed
> by the query, and then add a linear interpolation for those that partially
> overlap (e.g., area-of-cell/area-of-query*value-of-cell).

Yes, the histogram structure looks like:

// --------------------------------------------
// histogram2d type

// 2d histogram is a bounding box with a bunch of cells in it.
// The cells will have width (xmax-xmin)/boxesPerSide
//                 and height(ymax-ymin)/boxesPerSide
// The first box is the ll corner's box, the send is directly to the right
//   (row-major).
//
//  Size of structure is:
//        4 (size) + 32 (box) + 4 (boxesPerSide) +
//      boxesPerSide*boxesPerSide*4 (data)
typedef struct histotag
{
 int32  size;  // postgres variable-length type requirement
 int   boxesPerSide;   //boxesPerSide * boxesPerSide = total boxes in grid
 double  avgFeatureArea; // average bbox area of features in this histogram
  // these will be correctly aligned
 double      xmin,ymin, xmax, ymax; // BOX of area
 unsigned int value[1]; // variable length # of ints for histogram
} HISTOGRAM2D;

So, there's a list of grid boxes and each of them contain the number of features
that intersect the box by at least 5%.

I just added the avgFeatureArea so I could do estimates like this:

For each grid cell that intersects the query box
    Calculate area of intersection (AOI)
    IF AOI < avgFeatureArea THEN set AOI = avgFeatureArea
    SUM AOI/area-of-cell*value-of-cell

Which is pretty much what you said except its more stable when the query window
is very small (ie. has zero area).  In this situation, the naive estimate will
always be zero, when in fact it should be larger.
If both the query window and avgFeatureArea are zero, I'll estimate 1 return.

This should work well for large query windows, and somewhat accurate for small
ones. What think?

dave