[gdal-dev] Interoperability issues with deleted features in shapefiles

Even Rouault even.rouault at spatialys.com
Wed Jan 20 15:48:03 PST 2016


Le mercredi 20 janvier 2016 20:56:27, Jan Heckman a écrit :
> Hi Even, everyone,
> Sorry for not including the list - my mistake.
> I've experimented with larger shapefiles than 2 GB but not necessarily in
> combination with editing.
> I'll do a few tests when I get around to it. Doesn't the .shx file get
> rewritten anyway? 

If there's any change?, yes the .shx it entirely rewritten (but for deleted 
features, it still points to the shape location in the .shp). I'm not sure why 
the .shx isn't updated on the fly when shapes are created/moved. Probably 
because this was most straightforward to implement and .shx files are generally 
rather small compared to .shp/.dbf so ingesting them and rewriting them 
completely is generally not an issue.

> There could be some time-consuming actions at closing
> time partially masking the additional time needed for shp.
> Time needed for compacting the .shp would have a considerable potential
> variation depending on the extent of editing and the displacement caused in
> the shapefile.

Actually the current REPACK implementation is quite brutal: it creates fresh 
new temporary .shp and .dbf files and if things are OK rename them.

> My first idea was not to compact the shapefile (automatically), but do the
> .shx only (leaving out indexes of deleted shp records or setting their
> length in the shx and/or shp to zero). But there are some programs which do
> not pay much attention to the .shx anyway.

Yes OpenJump is one of them I believe. Or perhaps it uses it if present and 
otherwise work only with the .shp. Jukka would know

> If we can discount such
> behaviour, the shx route is ok, 

Not sure to follow what you mean by "the shx route"

> and a compact can be done as a separate
> action, like PACK in good old Dbase.
> Jan
> 
> 
> On Wed, Jan 20, 2016 at 11:51 AM, Even Rouault <even.rouault at spatialys.com>
> 
> wrote:
> > Jan,
> > 
> > Do you mind sharing your opinion with the list too ?
> > 
> > > Hi,
> > > I started a bit of a lib years and years ago when the shapelib code
> > 
> > didn't
> > 
> > > have delete.
> > > I implemented my own delete much as you mention (both dbf and shp/shx),
> > > with a repack at closing.
> > > Repack at closing never bothered me in the sense of any (very)
> > > noticeable delay.
> > > So I think it's indeed the best solution and the price is not high.
> > 
> > Depends on the size of shapefiles. For people with 2 GB shapefiles, that
> > might
> > be noticeable. But editing operations on such shapefiles aren't
> > necessarily very common admitedly.
> > 
> > > Regards,
> > > Jan
> > > 
> > > On Wed, Jan 20, 2016 at 12:42 AM, Even Rouault <
> > 
> > even.rouault at spatialys.com>
> > 
> > > wrote:
> > > > Hi,
> > > > 
> > > > There have been some recent discussion on the qgis list about an old
> > > > ticket https://hub.qgis.org/issues/11007
> > > > 
> > > > Basically the issue seems to be that a lot / most non-shapelib /
> > 
> > non-OGR
> > 
> > > > based
> > > > shapefile readers don't understand the way OGR delete features in
> > > > shapefiles.
> > > > 
> > > > When OGR/shapelib deletes a feature, it simply marks the
> > > > corresponding record
> > > > in the DBF as deleted (technically putting a '*' character in the
> > > > first byte of
> > > > the DBF record) and that's all. Very fast and OGR handles that
> > > > consistently (with the small restriction that the feature count
> > > > reports the deleted features as still existing, but iteration or
> > > > getting features by id do not report them)
> > > > 
> > > > This way of deleting a DBF record is the documented one :
> > > > http://www.clicketyclick.dk/databases/xbase/format/dbf.html#DBF_STRUC
> > > > T """
> > > > Deleted flag:
> > > > Value           Description
> > > > 2Ah (*)                 Record is deleted
> > > > 20h (blank)     Record is valid
> > > > """
> > > > 
> > > > However other GIS packages, and among others, a famous proprietary
> > > > one
> > 
> > -
> > 
> > > > let's
> > > > call it "LineGIS" - when reading such shapefiles do not recognize the
> > > > deleted
> > > > feature as deleted and display both the geometry and attributes. More
> > > > annoying, when "LineGIS" deletes another record in such a shapefile
> > > > and saves
> > > > the result, the shapefile can no longer be opened afterwards with an
> > > > error message reporting an inconsistency in number of shapes w.r.t
> > > > number of records
> > > > (and on inspection, the shp/shx indeed contain N - 1 records and the
> > 
> > dbf
> > 
> > > > N -
> > > > 2, so it looks like it would be semi-aware of deleted DBF records)
> > > > When "LineGIS" starts with a "clean" shapefile and deletes a record
> > > > in it, it
> > > > removes the corresponding entries in the .dbf, .shp and .shx files,
> > 
> > which
> > 
> > > > is
> > > > the result of the REPACK operation the shapefile driver can do if
> > > > explicitly
> > > > asked.
> > > > 
> > > > "LineGIS" isn't the only one to have troubles with deleted DBF
> > > > records. From
> > > > what I can see GeoTools (just picking a random example) only fully
> > 
> > handle
> > 
> > > > them
> > > > since 2014 :
> > > > https://osgeo-org.atlassian.net/browse/GEOT-4539
> > 
> > https://github.com/geotools/geotools/commit/e7333ccb284d137f3240ce5d0d09b
> > 
> > > > 3d7195f1890
> > > > 
> > > > The shapefile specification (
> > > > http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf ) doesn't
> > > > mention
> > > > about how deleted records should be handled. Particularly if the
> > > > requirement
> > > > "The table must contain one record per shape feature" (page 25)
> > > > allows DBF records marked as deleted... Anyway the theory/spec and
> > > > the
> > 
> > practice
> > 
> > > > are 2 different things.
> > > > 
> > > > What surprises me is such an issue didn't raise more loud complaints
> > > > before as
> > > > the OGR / shapelib behaviour has been the same since forever AFAIK.
> > > > 
> > > > I'm wondering if OGR shouldn't automatically run REPACK when closing
> > > > a shapefile when deletions (as well as edit operations of existing
> > 
> > features
> > 
> > > > leading to holes in the .shp) have happened. The side effect of this
> > > > would be a
> > > > slower closing (creation only scenarios wouldn't be affected) and a
> > > > renumbering
> > > > of the FID of features after the deleted feature(s).
> > > > 
> > > > Thoughts ?
> > > > 
> > > > (Regarding the QGIS issue, as QGIS explicitly runs REPACK after
> > > > edition/deleting, it is not clear why the issue would persist. But
> > > > some reports might be with older QGIS/GDAL versions)
> > > > 
> > > > Even
> > > > 
> > > > --
> > > > Spatialys - Geospatial professional services
> > > > http://www.spatialys.com
> > > > _______________________________________________
> > > > gdal-dev mailing list
> > > > gdal-dev at lists.osgeo.org
> > > > http://lists.osgeo.org/mailman/listinfo/gdal-dev
> > 
> > --
> > Spatialys - Geospatial professional services
> > http://www.spatialys.com

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com


More information about the gdal-dev mailing list