[Mapserver-users] Query: sorthshp, spatially clustering shapefiles on disk
richard.roger at agric.nsw.gov.au
richard.roger at agric.nsw.gov.au
Thu Jul 22 18:21:37 PDT 2004
This is a multipart message in MIME format.
--=_alternative 00077964CA256EDA_=
Content-Type: text/plain; charset="us-ascii"
Hello,
I have a query relating to the use of "sortshp", spatial indexing and
spatial clustering of data on disk.
I have a large shapefile (1.3 GB) showing the cadastre for my state (New
South Wales in Australia). Although the spatial indexing of the shapefile
makes retrieval of sections of the data fairly efficient, I believe it
could be improved by "clustering" the data spatially on disk. Can I use
"sortshp" to do this?
The symptom of the lack of efficiency is that when I display the complete
shapefile using, e.g., ArcView, the data comes up in patches all over the
place, rather randomly. If the data were laid out efficiently on disk, I
would expect the data to be displayed in some systematic manner. Another
symptom of inefficiency is that when I display only part of the state's
data, most of the data comes up quickly (because of the good indexing) but
it takes a while to fill in all the gaps (it seems it has to seek across a
large part of the file on disk to get all relevant data, because the data
are not spatially clustered on disk). [I do not wish to tile the data -
see below.]
For background information on this topic, I'd recommend this paper :
Fangju Wang and Yunli Sun, "Spatial Object Clustering and Buffering", IEEE
Multimedia, Jan-March 2002, Vol. 9, no. 1, pp. 26-42.
The key point here is that spatial indexing and the layout of the data on
disk ("clustering") are recognized as separate issues. For those who use
Informix and the Spatial Data Blade, the indexing is done using an R-Tree,
while you can cluster the data spatially using a sort key which "is
computed by applying the Hilbert space-filling curve algorithm to the
center point of an object's bounding box" (quote from the User's Guide,
Version 8.11, Aug. 2001, p. 8-160). I do not have appropriate access to
our Informix systems to allow me to spatially sort the data within
Informix, but I could sort the data outside Informix and re-insert it.
In essence, what I want to do is:
(1) add a "spatial sort key" to each shape, by writing the key value into
a new field in the attribute data for each shape, so that nearby
objects/shapes have nearby values of the spatial sort key (as with the
Hilbert curve);
(2) sort the shapefile by the spatial sort key using "sortshp", and write
it back to disk in sorted order.
Is this feasible?
The shapefile indexing mechanism is rather opaque to me, given the slim
documentation that I have been able to find. Are the spatial indexes
stored in order of shape object number? Do consecutive index values
actually retrieve objects which are (in general) nearby? I presume the
answer to this last point ought to be 'yes', given the quadtree indexing
is used.
I'd appreciate advice on how to go about doing this. I do have some
programming skills in C. By the way, my work would be done on a Sun
machine, using Solaris 8, and it's got 16 GB of RAM, so reading the entire
1.3 GB shapefile into memory is not out of the question, I think.
Advice gratefully received, and thanks to everyone who has contributed to
the Mapserver project,
Richard E. Roger
Dr. R. E. Roger NSW Department of
Primary Industries
Spatial Information Officer Systems 161 Kite St
Resource Information Unit Locked Bag 21
ORANGE NSW 2800
ph: (02) 6391 3697 fax: (02) 6391 3740
This message is intended for the addressee named and may contain
confidential information. If you are not the intended recipient or
received it in error, please delete the message and notify sender. Views
expressed are those of the individual sender and are not necessarily the
views of their organisation.
--=_alternative 00077964CA256EDA_=
Content-Type: text/html; charset="us-ascii"
<br><font size=2 face="sans-serif">Hello,</font>
<br>
<br><font size=2 face="sans-serif">I have a query relating to the use of "sortshp", spatial indexing and spatial clustering of data on disk.</font>
<br>
<br><font size=2 face="sans-serif">I have a large shapefile (1.3 GB) showing the cadastre for my state (New South Wales in Australia). Although the spatial indexing of the shapefile makes retrieval of sections of the data fairly efficient, I believe it could be improved by "clustering" the data spatially on disk. Can I use "sortshp" to do this? </font>
<br>
<br><font size=2 face="sans-serif">The symptom of the lack of efficiency is that when I display the complete shapefile using, e.g., ArcView, the data comes up in patches all over the place, rather randomly. If the data were laid out efficiently on disk, I would expect the data to be displayed in some systematic manner. Another symptom of inefficiency is that when I display only part of the state's data, most of the data comes up quickly (because of the good indexing) but it takes a while to fill in all the gaps (it seems it has to seek across a large part of the file on disk to get all relevant data, because the data are not spatially clustered on disk). [I do not wish to tile the data - see below.]</font>
<br>
<br><font size=2 face="sans-serif">For background information on this topic, I'd recommend this paper : </font>
<br>
<br><font size=2 face="sans-serif">Fangju Wang and Yunli Sun, "Spatial Object Clustering and Buffering", IEEE Multimedia, Jan-March 2002, Vol. 9, no. 1, pp. 26-42.</font>
<br>
<br><font size=2 face="sans-serif">The key point here is that spatial indexing and the layout of the data on disk ("clustering") are recognized as separate issues. For those who use Informix and the Spatial Data Blade, the indexing is done using an R-Tree, while you can cluster the data spatially using a sort key which "is computed by applying the Hilbert space-filling curve algorithm to the center point of an object's bounding box" (quote from the User's Guide, Version 8.11, Aug. 2001, p. 8-160). I do not have appropriate access to our Informix systems to allow me to spatially sort the data within Informix, but I could sort the data outside Informix and re-insert it.</font>
<br>
<br><font size=2 face="sans-serif">In essence, what I want to do is:</font>
<br><font size=2 face="sans-serif">(1) add a "spatial sort key" to each shape, by writing the key value into a new field in the attribute data for each shape, so that nearby objects/shapes have nearby values of the spatial sort key (as with the Hilbert curve);</font>
<br><font size=2 face="sans-serif">(2) sort the shapefile by the spatial sort key using "sortshp", and write it back to disk in sorted order.</font>
<br>
<br><font size=2 face="sans-serif">Is this feasible?</font>
<br>
<br><font size=2 face="sans-serif">The shapefile indexing mechanism is rather opaque to me, given the slim documentation that I have been able to find. Are the spatial indexes stored in order of shape object number? Do consecutive index values actually retrieve objects which are (in general) nearby? I presume the answer to this last point ought to be 'yes', given the quadtree indexing is used.</font>
<br>
<br><font size=2 face="sans-serif">I'd appreciate advice on how to go about doing this. I do have some programming skills in C. By the way, my work would be done on a Sun machine, using Solaris 8, and it's got 16 GB of RAM, so reading the entire 1.3 GB shapefile into memory is not out of the question, I think.</font>
<br>
<br><font size=2 face="sans-serif">Advice gratefully received, and thanks to everyone who has contributed to the Mapserver project,</font>
<br>
<br><font size=2 face="sans-serif">Richard E. Roger</font>
<br>
<br><font size=2 face="sans-serif">Dr. R. E. Roger NSW Department of <br>
Primary Industries<br>
Spatial Information Officer Systems 161 Kite St<br>
Resource Information Unit Locked Bag 21<br>
ORANGE NSW 2800<br>
ph: (02) 6391 3697 fax: (02) 6391 3740<br>
<br>
This message is intended for the addressee named and may contain confidential information. If you are not the intended recipient or received it in error, please delete the message and notify sender. Views expressed are those of the individual sender and are not necessarily the views of their organisation.</font>
--=_alternative 00077964CA256EDA_=--
More information about the MapServer-users
mailing list