[Mapserver-users] Query: sorthshp, spatially clustering shapefiles on disk

richard.roger at agric.nsw.gov.au richard.roger at agric.nsw.gov.au
Thu Jul 22 18:21:37 PDT 2004


This is a multipart message in MIME format.
--=_alternative 00077964CA256EDA_=
Content-Type: text/plain; charset="us-ascii"

Hello,

I have a query relating to the use of "sortshp", spatial indexing and 
spatial clustering of data on disk.

I have a large shapefile (1.3 GB) showing the cadastre for my state (New 
South Wales in Australia).  Although the spatial indexing of the shapefile 
makes retrieval of sections of the data fairly efficient, I believe it 
could be improved by "clustering" the data spatially on disk.  Can I use 
"sortshp" to do this? 

The symptom of the lack of efficiency is that when I display the complete 
shapefile using, e.g., ArcView, the data comes up in patches all over the 
place, rather randomly.  If the data were laid out efficiently on disk, I 
would expect the data to be displayed in some systematic manner.  Another 
symptom of inefficiency is that when I display only part of the state's 
data, most of the data comes up quickly (because of the good indexing) but 
it takes a while to fill in all the gaps (it seems it has to seek across a 
large part of the file on disk to get all relevant data, because the data 
are not spatially clustered on disk).  [I do not wish to tile the data - 
see below.]

For background information on this topic, I'd recommend this paper : 

Fangju Wang and Yunli Sun, "Spatial Object Clustering and Buffering", IEEE 
Multimedia, Jan-March 2002, Vol. 9, no. 1, pp. 26-42.

The key point here is that spatial indexing and the layout of the data on 
disk ("clustering") are recognized as separate issues.  For those who use 
Informix and the Spatial Data Blade, the indexing is done using an R-Tree, 
while you can cluster the data spatially using a sort key which "is 
computed by applying the Hilbert space-filling curve algorithm to the 
center point of an object's bounding box" (quote from the User's Guide, 
Version 8.11, Aug. 2001, p. 8-160).  I do not have appropriate access to 
our Informix systems to allow me to spatially sort the data within 
Informix, but I could sort the data outside Informix and re-insert it.

In essence, what I want to do is:
(1) add a "spatial sort key" to each shape, by writing the key value into 
a new field in the attribute data for each shape, so that nearby 
objects/shapes have nearby values of the spatial sort key (as with the 
Hilbert curve);
(2) sort the shapefile by the spatial sort key using "sortshp", and write 
it back to disk in sorted order.

Is this feasible?

The shapefile indexing mechanism is rather opaque to me, given the slim 
documentation that I have been able to find.  Are the spatial indexes 
stored in order of shape object number?  Do consecutive index values 
actually retrieve objects which are (in general) nearby?  I presume the 
answer to this last point ought to be 'yes', given the quadtree indexing 
is used.

I'd appreciate advice on how to go about doing this.  I do have some 
programming skills in C.  By the way, my work would be done on a Sun 
machine, using Solaris 8, and it's got 16 GB of RAM, so reading the entire 
1.3 GB shapefile into memory is not out of the question, I think.

Advice gratefully received, and thanks to everyone who has contributed to 
the Mapserver project,

Richard E. Roger

Dr. R. E. Roger                             NSW Department of 
                                         Primary Industries
Spatial Information Officer Systems                    161 Kite St
Resource Information Unit                     Locked Bag 21
                                           ORANGE NSW 2800
ph: (02) 6391 3697                               fax: (02) 6391 3740

This message is intended for the addressee named and may contain 
confidential information. If you are not the intended recipient or 
received it in error, please delete the message and notify sender. Views 
expressed are those of the individual sender and are not necessarily the 
views of their organisation.
--=_alternative 00077964CA256EDA_=
Content-Type: text/html; charset="us-ascii"


<br><font size=2 face="sans-serif">Hello,</font>
<br>
<br><font size=2 face="sans-serif">I have a query relating to the use of "sortshp", spatial indexing and spatial clustering of data on disk.</font>
<br>
<br><font size=2 face="sans-serif">I have a large shapefile (1.3 GB) showing the cadastre for my state (New South Wales in Australia).  Although the spatial indexing of the shapefile makes retrieval of sections of the data fairly efficient, I believe it could be improved by "clustering" the data spatially on disk.  Can I use "sortshp" to do this?  </font>
<br>
<br><font size=2 face="sans-serif">The symptom of the lack of efficiency is that when I display the complete shapefile using, e.g., ArcView, the data comes up in patches all over the place, rather randomly.  If the data were laid out efficiently on disk, I would expect the data to be displayed in some systematic manner.  Another symptom of inefficiency is that when I display only part of the state's data, most of the data comes up quickly (because of the good indexing) but it takes a while to fill in all the gaps (it seems it has to seek across a large part of the file on disk to get all relevant data, because the data are not spatially clustered on disk).  [I do not wish to tile the data - see below.]</font>
<br>
<br><font size=2 face="sans-serif">For background information on this topic, I'd recommend this paper : </font>
<br>
<br><font size=2 face="sans-serif">Fangju Wang and Yunli Sun, "Spatial Object Clustering and Buffering", IEEE Multimedia, Jan-March 2002, Vol. 9, no. 1, pp. 26-42.</font>
<br>
<br><font size=2 face="sans-serif">The key point here is that spatial indexing and the layout of the data on disk ("clustering") are recognized as separate issues.  For those who use Informix and the Spatial Data Blade, the indexing is done using an R-Tree, while you can cluster the data spatially using a sort key which "is computed by applying the Hilbert space-filling curve algorithm to the center point of an object's bounding box" (quote from the User's Guide, Version 8.11, Aug. 2001, p. 8-160).  I do not have appropriate access to our Informix systems to allow me to spatially sort the data within Informix, but I could sort the data outside Informix and re-insert it.</font>
<br>
<br><font size=2 face="sans-serif">In essence, what I want to do is:</font>
<br><font size=2 face="sans-serif">(1) add a "spatial sort key" to each shape, by writing the key value into a new field in the attribute data for each shape, so that nearby objects/shapes have nearby values of the spatial sort key (as with the Hilbert curve);</font>
<br><font size=2 face="sans-serif">(2) sort the shapefile by the spatial sort key using "sortshp", and write it back to disk in sorted order.</font>
<br>
<br><font size=2 face="sans-serif">Is this feasible?</font>
<br>
<br><font size=2 face="sans-serif">The shapefile indexing mechanism is rather opaque to me, given the slim documentation that I have been able to find.  Are the spatial indexes stored in order of shape object number?  Do consecutive index values actually retrieve objects which are (in general) nearby?  I presume the answer to this last point ought to be 'yes', given the quadtree indexing is used.</font>
<br>
<br><font size=2 face="sans-serif">I'd appreciate advice on how to go about doing this.  I do have some programming skills in C.  By the way, my work would be done on a Sun machine, using Solaris 8, and it's got 16 GB of RAM, so reading the entire 1.3 GB shapefile into memory is not out of the question, I think.</font>
<br>
<br><font size=2 face="sans-serif">Advice gratefully received, and thanks to everyone who has contributed to the Mapserver project,</font>
<br>
<br><font size=2 face="sans-serif">Richard E. Roger</font>
<br>
<br><font size=2 face="sans-serif">Dr. R. E. Roger                                    NSW Department of <br>
                                         Primary Industries<br>
Spatial Information Officer Systems                       161 Kite St<br>
Resource Information Unit                      Locked Bag 21<br>
                                           ORANGE NSW 2800<br>
ph: (02) 6391 3697                                 fax: (02) 6391 3740<br>
<br>
This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient or received it in error, please delete the message and notify sender. Views expressed are those of the individual sender and are not necessarily the views of their organisation.</font>
--=_alternative 00077964CA256EDA_=--



More information about the MapServer-users mailing list