[QGIS-Developer] GPKG and FID -- can we fix this mess?

Nyall Dawson nyall.dawson at gmail.com
Tue Oct 13 15:54:02 PDT 2020


On Wed, 14 Oct 2020 at 08:23, Even Rouault <even.rouault at spatialys.com> wrote:
>
> Hi Nyall,
>
>
>
> > - The type constraint on the fid column makes it really hard to
>
> > translate datasets with an existing, non-numeric "fid" column into
>
> > geopackage. Eg. GML files often have a textual fid string, and
>
> > attempting to convert these to gpkg results in a string of errors
>
> > about string values not being usable as fid values, and an empty
>
> > result layer. The only workaround here is to translate first to an
>
> > alternative format (such as shp!), delete the fid column, and THEN
>
> > save as gpkg.
>
>
>
> What do you do exactly to get such issues ? If you open a GML file, you'll get a 'gml_id' string column, so when saving that to GPKG or whatever, you'll also get a regular 'gml_id' column. This has nothing to do with the GPKG fid column. Or do you do something to inject the content of the 'gml_id' into the GPKG 'fid' column ? I can't reproduce a problem with a plain ogr2ogr or Export/Save features as in QGIS (with default settings at least)

Well -- here's an example file:
https://github.com/qgis/QGIS/blob/master/python/plugins/processing/tests/testdata/dissolve_polys.gml

Not sure how that file was created in the first place, but I've seen
many like it!

>
>
>
> > - The fid unique constraint, while understandable, results in a HUGE
>
> > raft of issues while working with these. It's SO easy to get a
>
> > situation where you have duplicate fids in an edit buffer, and no way
>
> > to save these features back to the gpkg. You get a series of 1000s of
>
> > errors about duplicate fid, and then an ambiguous state where you're
>
> > completely unsure exactly what's been saved and what's about to be
>
> > lost. This isn't just attributable to a single tool in QGIS -- it's
>
> > possible to end up with duplicate fids through so many different
>
> > operations, including really simple stuff like copying and pasting
>
> > features!
>
>
>
> Isn't the main issue here that we expose the fid column as a regular QGIS field, instead of keeping it as the fid specific property of a QgsFeature, as it should probably have remained ? That's really the main specificity on how the GPKG format is handled in the OGR provider.

Yes, ideally. But at this stage we can't completely hide the fid field
without breaking existing QGIS projects. Breaking scripts is bad, but
breaking projects is a complete no-go!

>
>
>
> > I propose that we
>
> >
>
> > 1. demote fids to being only a "semi-permanent" row identifier, with
>
> > the message being that "sometimes these WILL change and you can't rely
>
> > on them as a permanent id field for joins and row identification". If
>
> > users require a permanent unique identifier (i.e. a primary key) on
>
> > their table then THEY have to make and manage that themselves, just
>
> > like shapefiles etc.
>
>
>
> Why don't we just treat the fid as the regular FID returned by OGR for other drivers ?

That would also work, but we'd still need to expose these as an "fid"
attribute (for project compatibility, as noted above)

>
> I'm not familiar with the join fonctionnality in QGIS: but isn't there a way to use the QgsFeature.id() to do a join ? That could be a solution to have a permanent stable id. But if not, yes requiring users to create their own managed unique identifier would be understandable if they want to have control on the value of the identifier.

QgsFeature::id() isn't intended to be even semi-permanent. Just
"mostly constant for the duration of a single data provider's
lifetime" (i.e. a QGIS session). I don't think we can use it as a
permanent way of joining features.

>
>
>
> >
>
> > 2. expose fids as a read-only field. Users can still see them if they
>
> > want, but they cannot edit them.
>
>
>
> Sounds reasonable. But perhaps not exposing them as a column at all (and thus content that can be duplicated by error), and keeping it as the QgsFeature.id(), would be even more safer.
>
>
>
> >
>
> > 3. make QGIS (or GDAL?) ALWAYS generate a completely new fid whenever
>
> > a row is changed or added. Throwaway the old value, make a new one on
>
> > EVERY edit/addition.
>
>
>
> I'd be -1 on that, at least on the GDAL side. That would break an important and reasonable assumption of the format. That's how a row is identified... Why would we do that specifically on GPKG and not Postgres or other databases ?

Because, for fid at least, it's just an "internal detail" that we're
showing. To use the postgres analogy we don't manage internal record
identifiers with the returned features, just the actual exposed
columns themselves and leave the rest to the backend. And here I think
the backend (GDAL, or QGIS' OGR provider) should manage fids
transparently from the client (the QgsVectorLayer).

>
>
>
> > Yes, these changes will break existing workflows, and possibly break
>
> > existing tools/scripts. But honestly, in my experience and the
>
> > experience of my customers, there's a COMPLETE lack of faith and trust
>
> > in GPKG at this stage. EVERYONE has their horror stories of lost data
>
> > and mangled datasets. We've got to do something drastic, and we've got
>
> > to do it sooner rather than later to salvage what little hope does
>
> > remain for this format.
>
>
>
> To sum up my understanding of the problem: it seems to me that all the issues originate from exposing the OGRFeature.GetFID() content as a QGIS 'fid' column instead of just putting it in QgsFeature.id(). Otherwise we'd have problems with many other OGR formats. Maybe I'm missing something.

That's correct... but now we need to find a way to resolve this
situation without breaking existing projects. So (as noted above) I
think we still need to expose an "fid" column, but just ignore any
values in it for the purposes of saving records to the GPKG itself.

Nyall

>
>
>
> Even
>
>
>
> --
>
> Spatialys - Geospatial professional services
>
> http://www.spatialys.com


More information about the QGIS-Developer mailing list