[Mapserver-users] Query: sorthshp, spatially clustering shapefiles on disk

Thu Jul 22 20:41:38 PDT 2004

Hello Richard,

This is a very interesting idea. sortshp works by read the field you 
specify and build an array of struct {index, value} then sorting the 
array based on value and finally copies all the records from the infile 
to the outfile based on the sorted order. So I would think that the 
resulting shapefile will be as spatially contiguous as you can get it in 
a shapefile.

The other thing the will speed up performance some id to eliminate as 
many DBF column as you can to keep the file small as possible.

The spatial index is a quad tree index. I haven't looked the the 
specific details of the implementation, others may elaborate. It works 
by taking the extents of the shapefile and making spatial buckets that are
1) the extents
2) the extents quartered
3) each quarter quartered
4) each of those quartered again
5) and so forth to some depth or so that there are no more then n 
objects per bucket

and the objects are sorted into the buckets based fitting into the 
smallest spatial bucket that it will fit in. Then when the map is being 
drawn the extents of the map are overlaid on the spatial index buckets 
and the object associated with the spatial buckets.

I would be interested in the algorithm details for computing the spatial 
key and would consider making a sortshp based on the algorithm. There is 
no reason that it could not be computed on the fly to avoid having to 
add it to the shapefile.

Regards,
   -Stephen Woodbridge

richard.roger at agric.nsw.gov.au wrote:

> Hello,
> 
> I have a query relating to the use of "sortshp", spatial indexing and 
> spatial clustering of data on disk.
> 
> I have a large shapefile (1.3 GB) showing the cadastre for my state (New 
> South Wales in Australia).  Although the spatial indexing of the shapefile 
> makes retrieval of sections of the data fairly efficient, I believe it 
> could be improved by "clustering" the data spatially on disk.  Can I use 
> "sortshp" to do this? 
> 
> The symptom of the lack of efficiency is that when I display the complete 
> shapefile using, e.g., ArcView, the data comes up in patches all over the 
> place, rather randomly.  If the data were laid out efficiently on disk, I 
> would expect the data to be displayed in some systematic manner.  Another 
> symptom of inefficiency is that when I display only part of the state's 
> data, most of the data comes up quickly (because of the good indexing) but 
> it takes a while to fill in all the gaps (it seems it has to seek across a 
> large part of the file on disk to get all relevant data, because the data 
> are not spatially clustered on disk).  [I do not wish to tile the data - 
> see below.]
> 
> For background information on this topic, I'd recommend this paper : 
> 
> Fangju Wang and Yunli Sun, "Spatial Object Clustering and Buffering", IEEE 
> Multimedia, Jan-March 2002, Vol. 9, no. 1, pp. 26-42.
> 
> The key point here is that spatial indexing and the layout of the data on 
> disk ("clustering") are recognized as separate issues.  For those who use 
> Informix and the Spatial Data Blade, the indexing is done using an R-Tree, 
> while you can cluster the data spatially using a sort key which "is 
> computed by applying the Hilbert space-filling curve algorithm to the 
> center point of an object's bounding box" (quote from the User's Guide, 
> Version 8.11, Aug. 2001, p. 8-160).  I do not have appropriate access to 
> our Informix systems to allow me to spatially sort the data within 
> Informix, but I could sort the data outside Informix and re-insert it.
> 
> In essence, what I want to do is:
> (1) add a "spatial sort key" to each shape, by writing the key value into 
> a new field in the attribute data for each shape, so that nearby 
> objects/shapes have nearby values of the spatial sort key (as with the 
> Hilbert curve);
> (2) sort the shapefile by the spatial sort key using "sortshp", and write 
> it back to disk in sorted order.
> 
> Is this feasible?
> 
> The shapefile indexing mechanism is rather opaque to me, given the slim 
> documentation that I have been able to find.  Are the spatial indexes 
> stored in order of shape object number?  Do consecutive index values 
> actually retrieve objects which are (in general) nearby?  I presume the 
> answer to this last point ought to be 'yes', given the quadtree indexing 
> is used.
> 
> I'd appreciate advice on how to go about doing this.  I do have some 
> programming skills in C.  By the way, my work would be done on a Sun 
> machine, using Solaris 8, and it's got 16 GB of RAM, so reading the entire 
> 1.3 GB shapefile into memory is not out of the question, I think.
> 
> Advice gratefully received, and thanks to everyone who has contributed to 
> the Mapserver project,
> 
> Richard E. Roger
> 
> Dr. R. E. Roger                             NSW Department of 
>                                          Primary Industries
> Spatial Information Officer Systems                    161 Kite St
> Resource Information Unit                     Locked Bag 21
>                                            ORANGE NSW 2800
> ph: (02) 6391 3697                               fax: (02) 6391 3740
> 
> This message is intended for the addressee named and may contain 
> confidential information. If you are not the intended recipient or 
> received it in error, please delete the message and notify sender. Views 
> expressed are those of the individual sender and are not necessarily the 
> views of their organisation.