Updating Shapefiles, and data integrity

Wed Oct 13 05:51:32 PDT 1999

Believe a better approach would be to develop a Mapserver interface to
Oracle Spatial or ESRI SDE.  Let the database do all the locking, etc.
This would answer the mail on a number of other issues, too.  Both have open
C api that can be downloaded via the net.

Jim Sullivan
NIMA / TES

	-----Original Message-----
	From:	Stephen Lime [SMTP:steve.lime at dnr.state.mn.us]
	Sent:	Tuesday, October 12, 1999 5:37 PM
	To:	camerons at cat.org.au; bfraser at geoanalytic.ab.ca;
mapserver-users at lists.gis.umn.edu
	Subject:	Re: Updating Shapefiles, and data integrity

	The more I think about this, the more I understand why I never
attempted it.
	Locking is a real pain in the CGI world. When do you lock, when a
record
	is requested or when edits are submitted? If the latter then there
is a chance
	more than one person could request the same shape. I don't think
that 
	on-the-fly edits are possible robustly. Somehow I think edits need
to be cached
	and commited "behind the scenes". It's essential the shp, shx and
dbf records
	remain in sync. What about something like this:

	Assuming the there is a mechanism to request a shape and attributes
(and a checkout
	time)  and make changes. A user now sends back some edits. This
causes a record to 
	be written to a "pending" database. What actually gets saved are
things like source 
	shapefile, feature id (-1 for new), timestamp, etc. The actually
edits get saved in some format 
	(shapefile) as a  file whose name can be reconstructed from elements
in the pending database. 
	Now, periodically a process could go through and commit the edits in
the pending database
	to production (not web accessible) versions. When this is finished
the updated stuff
	could be swapped in for the old stuff and the pending database
purged (in part). The 
	commiting of the shapes would essentially involve rebuilding the
shapefile from the user
	edits and the production version (i.e. pick the edited version if it
exists). Put a lock in
	place while versions are being swapped and remove it when done,
probably only a
	few seconds. You could even maintain a history by saving previous
versions for
	some period of time or retiring shapes to some external format
(shapefile).

	As requests for shapes come in a quick check of the pending database
could be used
	to identify re-edits. If a timestamp is set when a shape is
requested then it could be
	compared against edits in the pending database to identify possible
problems. If a user
	requests an edited shape just send the pending edits as if they were
part of the current
	shapefile. New shapes are just added to the pending database and
make their way
	into the main database as part of the update process.

	Sounds complicated but really is only 2 processes, 1 database and a
bunch of cached
	edits. Timestamps can help alleviate simultaneous edits and a the
worst thing a user
	would see would be a message like "The record you're submitting has
changed since
	you requested it, cannot process the edit. Would you like to work
from the edited version?".

	Again, without some sort of a persistant connection I doubt that
real-time editing is possible.
	One could bump the commit time up to a few minutes or even seconds
though so it would
	certainly seem real-time.

	(Cameron, this approach would involve no editing of Frank's shapelib
at all since all you're
	doing is reading and writing individual records. The effort goes
into getting all the communication
	working right. Could even be a perl script with system calls to the
shapelib utils for creating
	files and adding info.)

	Steve

	Stephen Lime
	Internet Applications Analyst
	MIS Bureau - MN DNR

	(651) 297-2937
	steve.lime at dnr.state.mn.us

	>>> "bfraser" <bfraser at geoanalytic.ab.ca> 10/12 11:00 AM >>>
	see my comments below...

	Brent Fraser

	----- Original Message -----
	From: Cameron Shorter <cshorter at optusnet.com.au>
	To: mapserver <mapserver-users at lists.gis.umn.edu>
	Sent: Sunday, October 10, 1999 4:14 AM
	Subject: Updating Shapefiles, and data integrity

	>
	>
	> -------- Original Message --------
	> Subject: Re: request for comments...
	> Date: Sun, 10 Oct 1999 19:16:24 +1000
	> From: Cameron Shorter <cshorter at optusnet.com.au>
	> Reply-To: camerons at cat.org.au 
	> To: Stephen Lime <steve.lime at dnr.state.mn.us>
	> References: <s7ff017e.048 at smtp.dnr.state.mn.us>
	>
	>
	>
	> Stephen Lime wrote:
	> >
	> > Maintaining data integrity is going to be a big issue. I was at
our
	state GIS conference and got to chat with Jack Dangermond from ESRI
about
	the MapServer and their new ArcIMS product. Seems they're having
trouble
	with this editing stuff. Shapefiles just aren't a transactional
environment
	so unless you can assure yourself of single user access there's
always the
	potential for multiple concurrent edits. Then there's the issue of
quality
	control. I think the solution needs to offer immediate update and
delayed
	update. ArcIMS, as I understand it, caches updates until an operator
on the
	server site commits the edits to the main database. This operator
could be a
	cron process I suppose that could handle locking while
	> > edits are processed. I think this may be a good approach as you
could do
	some simple transaction management- review, edit and delete, once
the
	initial work was done. Edits could be stored in a shapefile along
with
	attributes and enough additional information to commit the shape -
source
	shapefile, shape (or is a new one), type of edit (replace, attribute
change)
	etc.
	> >
	> > Anyway, just my thoughts...
	> >
	> > Steve
	>

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
	v

	Enterprise-wide editing can require a lot of infrastructure to
support it.
	A large-scale
	implementation might include (this is only one scenario):

	 o  a data repository  / warehouse / database
	 o  project workspaces for editing
	 o  a "view-only" copy of the data

	Typical workflows would include:

	1. Edit: operator identifies features in warehouse for editing,
locks them,
	extracts them
	    to the project workspace.  The features are edited, possibly
reviewed,
	then checked
	    back into the warehouse.  This is sometimes known as a "long
	transaction"
	    Some things that may be important:
	        1. feature level locking (as apposed to file locking) to
prevent
	simultaneous editing
	        2. feature lineage tracking: timestamps, feature
"retirement"
	instead of deletion
	        3. theme security: certain departments can edit only
specific themes

	2. Copy:  at a pre-determined schedule, the warehouse is copied to
the
	"View-only"
	    database.  This may include re-formatting, indexing and
distributing the
	data to get better
	    performance for viewing.  Depending on the edits, the copy could
be once
	a day,
	    once a month, etc.  The good thing about this approach is that
the user
	    (viewer/querier) has a stable data set to operate on.  The bad
thing is
	it might not be
	    up to date.

	3. Viewing: the data is queried and rendered for thick and thin
client apps.

	Of course all this might be unnecessary if you only have occasional
edits
	and a few
	viewers....

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
	v

	> I'm glad to hear I'm not the only one having problems with
updating
	> shapefiles. :)
	>
	> >From looking at the shapefile definition paper, you can see that
there is
	an
	> index file .SHX which points to a .SHP file which has variable
length
	records.
	>
	> There are a few problems that I can see.  Please verify if any of
these
	are
	> correct or not.
	> 1. Deleting an old object.  I think this can be handled by setting
the
	> shapetype to a NULL shape.
	>
	> 2. Increasing the number of vertices of a shape, and hence
increasing the
	> record size.  I think the best way to handle this is to remove the
old
	shape
	> by setting its shapetype to NULL, and to add a new shape to the
end of the
	> .SHP file.  The pointer in the .SHX file will now have to be
redirected to
	the
	> end of the .SHP file.  This now means that the order of the .SHP
file and
	the
	> .SHX file will not match, which will reduce query speeds, so
periodically
	the
	> datafiles would need to be rebuilt.
	>
	> 3. There is an issue with the .SHP file and .SHX file becoming out
of
	sync.
	> Basically, when a shape is updated, first the .SHP file will need
to be
	> updated, and some time later the .SHX file will be updated.  There
is a
	window
	> of opportunity where the files will be out of sync.  I was
planning to
	address
	> this by either putting in a lock file, or changing read/write
permissions
	to
	> lock the files while the database is out of sync.
	> This means that some reads of the database will fail because the
database
	is
	> updating.
	>
	> 4. I'm not sure what the best way is to link into a SQL database.
If the
	> shapefile is only added to, then the best way to reference an
object is by
	> using the index in the .SHX file.  However, if you delete an
object,
	should
	> you rebuild the .SHX file?  This will keep the index file from
blowing
	out,
	> but all the indexes will change and hence the  SQL database will
reference
	the
	> wrong indices.

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
	v
	How about a unique key stored in the dbf file used to join to the
SQL
	database?

	This would allow for many shapefiles joining to a single SQL table
(might be
	useful if the data is tiled.)

vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
	v

	>
	> Happy for any advice.
	>
	> Cameron.
	>