[postgis-users] Histogram2d formation

David Blasby dblasby at refractions.net
Wed Oct 9 14:20:01 PDT 2002


I've been testing the accuracy of the histogram estimate.

I'm using the BC roads data - which has lots and lots (195,000) of small
segments.  Most of the roads are in Vancouver, so the 4 histogram grid squares
that vancouver comprises has the majority of data.  I formed a histogram with
40*40 cells (about 36,000m * 34,000m).

I changed the estimator slightly - now it uses the average feature area OR 10% of
a grid cell, whichever is smaller, for small search boxes.  In this case the
median feature was about 2500 square meters, but the average was 1236765 square
meters.  As you can see the average is highly scewed.  It would be better to use
the median feature size, but its quite difficult to calculate.

I made 300,000 randomly placed boxes with edges of size between 1m and 10000m
(random).

I then did a normal "&&" search to find the true number of overlapping features
and also used the histogram to estimate the number of overlapping features.

The average error (estimated hits vs actual hits) was 3.  The average error where
there was an error was 20 (49,127 of 300,000).  Only 34,062 of the 300,000 boxes
actually intersected a feature.  The largest error was 4,935 but 99% were < 325,
95% were <70, and 90% were <30.  The larger errors were in areas where in areas of
highly scewed data (ie. near vancouver).

The attached picture has examples of errors.  You can see the histogram grid cells
(the very big black boxes), vancover roads (purple), and the smaller boxes that
represent sample boxes.  The sample boxes are coloured according to size - the
larger errors are darker.  These boxes  represent areas that actually had zero
hits, but the estimate said it should have several.  Notice that the data schew
causes almost all these errors.

I've also attached the a picture of error queries for everything in the vancouver
area.

The next step was to look at larger query boxes - between 10km and 500km on each
side.  I only looked at 150,000 of these because it takes a LONG time to do the
"&&" queries for larger areas.




-------------- next part --------------
A non-text attachment was scrubbed...
Name: hitogram_zerohits.gif
Type: image/gif
Size: 54154 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20021009/0d00e2fa/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: histo_all.gif
Type: image/gif
Size: 97540 bytes
Desc: not available
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20021009/0d00e2fa/attachment-0001.gif>


More information about the postgis-users mailing list