[QGIS-Developer] GPKG and FID -- can we fix this mess?

Nyall Dawson nyall.dawson at gmail.com
Tue Oct 13 14:44:49 PDT 2020


Hi list,

(Linus Torvalds-style harsh truths incoming, read only after coffee/alcohol!)

Having spent an incredibly frustrating day fighting with the
limitations of GPKG and the horrible workflow that they mandate, I'd
love to start brainstorming on how we can fix this.

While previous discussions have related to the GPKG sqlite wal mess,
that has (to the extent of my experience) been resolved in the latest
release. So I'd like to focus on what I see as the biggest pain point
of GPKG: the FID column.

This is a pain point for numerous reasons:

- The type constraint on the fid column makes it really hard to
translate datasets with an existing, non-numeric "fid" column into
geopackage. Eg. GML files often have a textual fid string, and
attempting to convert these to gpkg results in a string of errors
about string values not being usable as fid values, and an empty
result layer. The only workaround here is to translate first to an
alternative format (such as shp!), delete the fid column, and THEN
save as gpkg.

- The fid unique constraint, while understandable, results in a HUGE
raft of issues while working with these. It's SO easy to get a
situation where you have duplicate fids in an edit buffer, and no way
to save these features back to the gpkg. You get a series of 1000s of
errors about duplicate fid, and then an ambiguous state where you're
completely unsure exactly what's been saved and what's about to be
lost. This isn't just attributable to a single tool in QGIS -- it's
possible to end up with duplicate fids through so many different
operations, including really simple stuff like copying and pasting
features!

I've fought with this since we've really started to push GPKG and,
frankly, I've given up. I don't think there's any way to fix the
current situation and leave fids as they currently behave.

So what I propose is a radical re-think about how GPKG fids are
handled and exposed by QGIS (and by GDAL).

I propose that we

1. demote fids to being only a "semi-permanent" row identifier, with
the message being that "sometimes these WILL change and you can't rely
on them as a permanent id field for joins and row identification". If
users require a permanent unique identifier (i.e. a primary key) on
their table then THEY have to make and manage that themselves, just
like shapefiles etc.

2. expose fids as a read-only field. Users can still see them if they
want, but they cannot edit them.

3. make QGIS (or GDAL?) ALWAYS generate a completely new fid whenever
a row is changed or added. Throwaway the old value, make a new one on
EVERY edit/addition.

4 We COMPLETELY ignore any existing fid value set for features added
to a GPKG layer. I.e. in the case of translating a GML with a text fid
field, we completely ignore the incoming GML fid values and instead
use the "always generate a new fid" rule.

Yes, these changes will break existing workflows, and possibly break
existing tools/scripts. But honestly, in my experience and the
experience of my customers, there's a COMPLETE lack of faith and trust
in GPKG at this stage. EVERYONE has their horror stories of lost data
and mangled datasets. We've got to do something drastic, and we've got
to do it sooner rather than later to salvage what little hope does
remain for this format.

Thoughts?

Nyall


More information about the QGIS-Developer mailing list