[Gdal-dev] Content length field mismatch in shapefiles
Roger Bivand
Roger.Bivand at nhh.no
Sun Apr 30 11:06:34 EDT 2006
On Sat, 29 Apr 2006, Roger Bivand wrote:
> I have a question, about shapefiles - specifically Geolytics seem to
> provide US subscribers with shapefiles with the content length of the
> *.shx 6 decimal above the *.shp content length, and 4 decimal above what
> it should be (after checking by creating a new *.shp and *.shx in
> shapelib), which throws shapelib (usually on the final geometry). This is
> generalising from a small sample, but the user who contacted me reported
> needing to use special treatment on the Geolytics files found at his
> university that he tried to read using shapelib-based software.
The 4 decimal is in 16-bit words, so 8 bytes, the 8 added in
SHPReadObject() to the object size from *.shx. Geolytics construct the
*.shx such that offset[i] == offset[i-1]+size[i-1], which works if the
software doesn't use or doesn't trust the *.shx. They don't get the
content length right in the main file record header either, and ignore the
last para in the main file record header section in the ESRI specs (p. 5):
"the content length for a record is the length of the record *contents*
section measured in 16-bit words" - they include the 8-byte main file
record header too, hence the wrong *.shx.
For my purposes (and for the record) the following fixes this case:
hSHP = SHPOpen(CHAR(STRING_ELT(shpnm,0)), "rb" );
if( hSHP == NULL )
error("unable to open file");
qRep = LOGICAL_POINTER(repair)[0];
nEntities = hSHP->nRecords;
nImpliedEOF = hSHP->panRecOffset[hSHP->nRecords-1] +
hSHP->panRecSize[hSHP->nRecords-1] + 8;
/* file length implied by *.shx */
if (nImpliedEOF > hSHP->nFileSize && qRep == 0) {
error("File size and implied file size differ");
/* implied file length greater than file size */
}
if (qRep == 1 && nImpliedEOF > hSHP->nFileSize) {
for (i=1, j=0; i < hSHP->nRecords; i++)
if (hSHP->panRecOffset[i] != (hSHP->panRecOffset[i-1] +
hSHP->panRecSize[i-1])) j++;
if (j > 0) error("Cannot repair file size error");
if (j == 0) {/* Geolytics size + 8 bug */
for (i=1; i < hSHP->nRecords; i++)
hSHP->panRecSize[i] = hSHP->panRecSize[i] - 8;
warning("SHX object size off by 8 bug repaired");
}
}
This does not seem to cause problems on shapefiles following ESRI specs,
and will only correct geometry sizes from *.shx if all the geometries are
off by 8 bytes - the symptom of this sloppiness with the specs.
If anyone can see errors of interpretation here, I'd be grateful. I don't
think this fix can be applied to OGR/shapelib from the API, though, is
that right?
Best wishes,
Roger
>
> Has anyone ever heard of this? The files will read into ArcGIS, and in R
> the shapefiles package, read.shp() and read.shx() only use native R binary
> reads can read them sequentially, because they don't try to do random
> access on the *.shp. ArcGIS seems to spend more time than usual for files
> of that complexity, but gets round the problem, v.in.ogr in GRASS says
> that no geometry is available for one DBF record, but processes all but
> the last geometry.
>
> The Geolytics problem seems to be that the length values in the *.shx file
> don't agree with the *.shp. ESRI say "The content length stored in the
> index record is the same as the value stored in the main file record
> header", but for a sample file:
>
> > library(shapefiles)
> > geolytics <- read.shp("jw_wacounty.shp")
> > geolytics_content.length <- sapply(geolytics$shp, function(x)
> x$content.length)
> > geolytics_content.length
> [1] 382 542 726 3574 750 846 398 806 1550 1438 878 646 902 590 960
> [16] 710 2190 534 2374 582 982 854 438 2446 750 158 2390 414 1422 430
> [31] 998 342 1782 1094 254 574 1096 1182 1558
> > geoshx <- read.shx("jw_wacounty.shx")
> > geoshx$index[,2]
> [1] 388 548 732 3580 756 852 404 812 1556 1444 884 652 908 596 966
> [16] 716 2196 540 2380 588 988 860 444 2452 756 164 2396 420 1428 436
> [31] 1004 348 1788 1100 260 580 1102 1188 1564
>
> I was sent the sample file by a user unable to read it into R using the
> shapelib-based packages, but because it is Geolytics, I can't post it. I
> can ask for permission to email a copy.
>
> Roger
>
>
--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no
More information about the Gdal-dev
mailing list