Updating Shapefiles, and data integrity

Tue Oct 12 14:36:38 PDT 1999

The more I think about this, the more I understand why I never attempted it.
Locking is a real pain in the CGI world. When do you lock, when a record
is requested or when edits are submitted? If the latter then there is a chance
more than one person could request the same shape. I don't think that 
on-the-fly edits are possible robustly. Somehow I think edits need to be cached
and commited "behind the scenes". It's essential the shp, shx and dbf records
remain in sync. What about something like this:

Assuming the there is a mechanism to request a shape and attributes (and a checkout
time)  and make changes. A user now sends back some edits. This causes a record to 
be written to a "pending" database. What actually gets saved are things like source 
shapefile, feature id (-1 for new), timestamp, etc. The actually edits get saved in some format 
(shapefile) as a  file whose name can be reconstructed from elements in the pending database. 
Now, periodically a process could go through and commit the edits in the pending database
to production (not web accessible) versions. When this is finished the updated stuff
could be swapped in for the old stuff and the pending database purged (in part). The 
commiting of the shapes would essentially involve rebuilding the shapefile from the user
edits and the production version (i.e. pick the edited version if it exists). Put a lock in
place while versions are being swapped and remove it when done, probably only a
few seconds. You could even maintain a history by saving previous versions for
some period of time or retiring shapes to some external format (shapefile).

As requests for shapes come in a quick check of the pending database could be used
to identify re-edits. If a timestamp is set when a shape is requested then it could be
compared against edits in the pending database to identify possible problems. If a user
requests an edited shape just send the pending edits as if they were part of the current
shapefile. New shapes are just added to the pending database and make their way
into the main database as part of the update process.

Sounds complicated but really is only 2 processes, 1 database and a bunch of cached
edits. Timestamps can help alleviate simultaneous edits and a the worst thing a user
would see would be a message like "The record you're submitting has changed since
you requested it, cannot process the edit. Would you like to work from the edited version?".

Again, without some sort of a persistant connection I doubt that real-time editing is possible.
One could bump the commit time up to a few minutes or even seconds though so it would
certainly seem real-time.

(Cameron, this approach would involve no editing of Frank's shapelib at all since all you're
doing is reading and writing individual records. The effort goes into getting all the communication
working right. Could even be a perl script with system calls to the shapelib utils for creating
files and adding info.)

Steve

Stephen Lime
Internet Applications Analyst
MIS Bureau - MN DNR

(651) 297-2937
steve.lime at dnr.state.mn.us

>>> "bfraser" <bfraser at geoanalytic.ab.ca> 10/12 11:00 AM >>>
see my comments below...

Brent Fraser

----- Original Message -----
From: Cameron Shorter <cshorter at optusnet.com.au>
To: mapserver <mapserver-users at lists.gis.umn.edu>
Sent: Sunday, October 10, 1999 4:14 AM
Subject: Updating Shapefiles, and data integrity

>
>
> -------- Original Message --------
> Subject: Re: request for comments...
> Date: Sun, 10 Oct 1999 19:16:24 +1000
> From: Cameron Shorter <cshorter at optusnet.com.au>
> Reply-To: camerons at cat.org.au 
> To: Stephen Lime <steve.lime at dnr.state.mn.us>
> References: <s7ff017e.048 at smtp.dnr.state.mn.us>
>
>
>
> Stephen Lime wrote:
> >
> > Maintaining data integrity is going to be a big issue. I was at our
state GIS conference and got to chat with Jack Dangermond from ESRI about
the MapServer and their new ArcIMS product. Seems they're having trouble
with this editing stuff. Shapefiles just aren't a transactional environment
so unless you can assure yourself of single user access there's always the
potential for multiple concurrent edits. Then there's the issue of quality
control. I think the solution needs to offer immediate update and delayed
update. ArcIMS, as I understand it, caches updates until an operator on the
server site commits the edits to the main database. This operator could be a
cron process I suppose that could handle locking while
> > edits are processed. I think this may be a good approach as you could do
some simple transaction management- review, edit and delete, once the
initial work was done. Edits could be stored in a shapefile along with
attributes and enough additional information to commit the shape - source
shapefile, shape (or is a new one), type of edit (replace, attribute change)
etc.
> >
> > Anyway, just my thoughts...
> >
> > Steve
>

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
v

Enterprise-wide editing can require a lot of infrastructure to support it.
A large-scale
implementation might include (this is only one scenario):

 o  a data repository  / warehouse / database
 o  project workspaces for editing
 o  a "view-only" copy of the data

Typical workflows would include:

1. Edit: operator identifies features in warehouse for editing, locks them,
extracts them
    to the project workspace.  The features are edited, possibly reviewed,
then checked
    back into the warehouse.  This is sometimes known as a "long
transaction"
    Some things that may be important:
        1. feature level locking (as apposed to file locking) to prevent
simultaneous editing
        2. feature lineage tracking: timestamps, feature "retirement"
instead of deletion
        3. theme security: certain departments can edit only specific themes

2. Copy:  at a pre-determined schedule, the warehouse is copied to the
"View-only"
    database.  This may include re-formatting, indexing and distributing the
data to get better
    performance for viewing.  Depending on the edits, the copy could be once
a day,
    once a month, etc.  The good thing about this approach is that the user
    (viewer/querier) has a stable data set to operate on.  The bad thing is
it might not be
    up to date.

3. Viewing: the data is queried and rendered for thick and thin client apps.

Of course all this might be unnecessary if you only have occasional edits
and a few
viewers....

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
v

> I'm glad to hear I'm not the only one having problems with updating
> shapefiles. :)
>
> >From looking at the shapefile definition paper, you can see that there is
an
> index file .SHX which points to a .SHP file which has variable length
records.
>
> There are a few problems that I can see.  Please verify if any of these
are
> correct or not.
> 1. Deleting an old object.  I think this can be handled by setting the
> shapetype to a NULL shape.
>
> 2. Increasing the number of vertices of a shape, and hence increasing the
> record size.  I think the best way to handle this is to remove the old
shape
> by setting its shapetype to NULL, and to add a new shape to the end of the
> .SHP file.  The pointer in the .SHX file will now have to be redirected to
the
> end of the .SHP file.  This now means that the order of the .SHP file and
the
> .SHX file will not match, which will reduce query speeds, so periodically
the
> datafiles would need to be rebuilt.
>
> 3. There is an issue with the .SHP file and .SHX file becoming out of
sync.
> Basically, when a shape is updated, first the .SHP file will need to be
> updated, and some time later the .SHX file will be updated.  There is a
window
> of opportunity where the files will be out of sync.  I was planning to
address
> this by either putting in a lock file, or changing read/write permissions
to
> lock the files while the database is out of sync.
> This means that some reads of the database will fail because the database
is
> updating.
>
> 4. I'm not sure what the best way is to link into a SQL database.  If the
> shapefile is only added to, then the best way to reference an object is by
> using the index in the .SHX file.  However, if you delete an object,
should
> you rebuild the .SHX file?  This will keep the index file from blowing
out,
> but all the indexes will change and hence the  SQL database will reference
the
> wrong indices.

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
v
How about a unique key stored in the dbf file used to join to the SQL
database?

This would allow for many shapefiles joining to a single SQL table (might be
useful if the data is tiled.)
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
v

>
> Happy for any advice.
>
> Cameron.
>