[Gdal-dev] Content length field mismatch in shapefiles

Roger Bivand Roger.Bivand at nhh.no
Sun Apr 30 11:06:34 EDT 2006


On Sat, 29 Apr 2006, Roger Bivand wrote:

> I have a question, about shapefiles - specifically Geolytics seem to
> provide US subscribers with shapefiles with the content length of the
> *.shx 6 decimal above the *.shp content length, and 4 decimal above what
> it should be (after checking by creating a new *.shp and *.shx in
> shapelib), which throws shapelib (usually on the final geometry). This is
> generalising from a small sample, but the user who contacted me reported
> needing to use special treatment on the Geolytics files found at his
> university that he tried to read using shapelib-based software.

The 4 decimal is in 16-bit words, so 8 bytes, the 8 added in 
SHPReadObject() to the object size from *.shx. Geolytics construct the 
*.shx such that offset[i] == offset[i-1]+size[i-1], which works if the 
software doesn't use or doesn't trust the *.shx. They don't get the 
content length right in the main file record header either, and ignore the 
last para in the main file record header section in the ESRI specs (p. 5): 
"the content length for a record is the length of the record *contents* 
section measured in 16-bit words" - they include the 8-byte main file 
record header too, hence the wrong *.shx.

For my purposes (and for the record) the following fixes this case:

    hSHP = SHPOpen(CHAR(STRING_ELT(shpnm,0)), "rb" );
    if( hSHP == NULL )    
	error("unable to open file");

    qRep = LOGICAL_POINTER(repair)[0];

    nEntities = hSHP->nRecords;
    nImpliedEOF = hSHP->panRecOffset[hSHP->nRecords-1] + 
	hSHP->panRecSize[hSHP->nRecords-1] + 8; 
/* file length implied by *.shx */
    if (nImpliedEOF > hSHP->nFileSize && qRep == 0) {
	error("File size and implied file size differ"); 
/* implied file length greater than file size */
    }

    if (qRep == 1 && nImpliedEOF > hSHP->nFileSize) {
	for (i=1, j=0; i < hSHP->nRecords; i++)
	    if (hSHP->panRecOffset[i] != (hSHP->panRecOffset[i-1] + 
	        hSHP->panRecSize[i-1])) j++;
	if (j > 0) error("Cannot repair file size error");
	if (j == 0) {/* Geolytics size + 8 bug */
	    for (i=1; i < hSHP->nRecords; i++) 
	        hSHP->panRecSize[i] = hSHP->panRecSize[i] - 8;
	    warning("SHX object size off by 8 bug repaired");
	}
    }

This does not seem to cause problems on shapefiles following ESRI specs, 
and will only correct geometry sizes from *.shx if all the geometries are 
off by 8 bytes - the symptom of this sloppiness with the specs.

If anyone can see errors of interpretation here, I'd be grateful. I don't 
think this fix can be applied to OGR/shapelib from the API, though, is 
that right?

Best wishes,

Roger


> 
> Has anyone ever heard of this? The files will read into ArcGIS, and in R
> the shapefiles package, read.shp() and read.shx() only use native R binary
> reads can read them sequentially, because they don't try to do random
> access on the *.shp. ArcGIS seems to spend more time than usual for files
> of that complexity, but gets round the problem, v.in.ogr in GRASS says
> that no geometry is available for one DBF record, but processes all but
> the last geometry.
> 
> The Geolytics problem seems to be that the length values in the *.shx file
> don't agree with the *.shp. ESRI say "The content length stored in the
> index record is the same as the value stored in the main file record
> header", but for a sample file:
> 
> > library(shapefiles)
> > geolytics <- read.shp("jw_wacounty.shp") 
> > geolytics_content.length <- sapply(geolytics$shp, function(x) 
>    x$content.length)
> > geolytics_content.length
>  [1]  382  542  726 3574  750  846  398  806 1550 1438  878  646  902  590  960
> [16]  710 2190  534 2374  582  982  854  438 2446  750  158 2390  414 1422  430
> [31]  998  342 1782 1094  254  574 1096 1182 1558
> > geoshx <- read.shx("jw_wacounty.shx")
> > geoshx$index[,2]
>  [1]  388  548  732 3580  756  852  404  812 1556 1444  884  652  908  596  966
> [16]  716 2196  540 2380  588  988  860  444 2452  756  164 2396  420 1428  436
> [31] 1004  348 1788 1100  260  580 1102 1188 1564
> 
> I was sent the sample file by a user unable to read it into R using the 
> shapelib-based packages, but because it is Geolytics, I can't post it. I 
> can ask for permission to email a copy.
> 
> Roger
> 
> 

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no




More information about the Gdal-dev mailing list