FW: Updating Shapefiles, and data integrity

Wed Oct 13 06:19:18 PDT 1999

This may be fine as an "extension", but if it became integral, it would certainly change the product from a free program to a pretty expensive one.

Even if a non-commercial solution to this issue is a little kludgy and slower than real-time, it may serve most people's needs, and will avoid spendy Oracle or SDE licenses.

David Fawcett
Minnesota Office of Environmental Assistance

> ----------
> From: 	Sullivan, James R.[SMTP:SullivanJ at nima.mil]
> Sent: 	Wednesday, October 13, 1999 7:51 AM
> To: 	'Stephen Lime'; camerons at cat.org.au; bfraser at geoanalytic.ab.ca; mapserver-users at lists.gis.umn.edu
> Subject: 	RE: Updating Shapefiles, and data integrity
> 
> Believe a better approach would be to develop a Mapserver interface to
> Oracle Spatial or ESRI SDE.  Let the database do all the locking, etc.
> This would answer the mail on a number of other issues, too.  Both have open
> C api that can be downloaded via the net.
> 
> 
> Jim Sullivan
> NIMA / TES
> 
> 	-----Original Message-----
> 	From:	Stephen Lime [SMTP:steve.lime at dnr.state.mn.us]
> 	Sent:	Tuesday, October 12, 1999 5:37 PM
> 	To:	camerons at cat.org.au; bfraser at geoanalytic.ab.ca;
> mapserver-users at lists.gis.umn.edu
> 	Subject:	Re: Updating Shapefiles, and data integrity
> 
> 	The more I think about this, the more I understand why I never
> attempted it.
> 	Locking is a real pain in the CGI world. When do you lock, when a
> record
> 	is requested or when edits are submitted? If the latter then there
> is a chance
> 	more than one person could request the same shape. I don't think
> that 
> 	on-the-fly edits are possible robustly. Somehow I think edits need
> to be cached
> 	and commited "behind the scenes". It's essential the shp, shx and
> dbf records
> 	remain in sync. What about something like this:
> 
> 	Assuming the there is a mechanism to request a shape and attributes
> (and a checkout
> 	time)  and make changes. A user now sends back some edits. This
> causes a record to 
> 	be written to a "pending" database. What actually gets saved are
> things like source 
> 	shapefile, feature id (-1 for new), timestamp, etc. The actually
> edits get saved in some format 
> 	(shapefile) as a  file whose name can be reconstructed from elements
> in the pending database. 
> 	Now, periodically a process could go through and commit the edits in
> the pending database
> 	to production (not web accessible) versions. When this is finished
> the updated stuff
> 	could be swapped in for the old stuff and the pending database
> purged (in part). The 
> 	commiting of the shapes would essentially involve rebuilding the
> shapefile from the user
> 	edits and the production version (i.e. pick the edited version if it
> exists). Put a lock in
> 	place while versions are being swapped and remove it when done,
> probably only a
> 	few seconds. You could even maintain a history by saving previous
> versions for
> 	some period of time or retiring shapes to some external format
> (shapefile).
> 
> 	As requests for shapes come in a quick check of the pending database
> could be used
> 	to identify re-edits. If a timestamp is set when a shape is
> requested then it could be
> 	compared against edits in the pending database to identify possible
> problems. If a user
> 	requests an edited shape just send the pending edits as if they were
> part of the current
> 	shapefile. New shapes are just added to the pending database and
> make their way
> 	into the main database as part of the update process.
> 
> 	Sounds complicated but really is only 2 processes, 1 database and a
> bunch of cached
> 	edits. Timestamps can help alleviate simultaneous edits and a the
> worst thing a user
> 	would see would be a message like "The record you're submitting has
> changed since
> 	you requested it, cannot process the edit. Would you like to work
> from the edited version?".
> 
> 	Again, without some sort of a persistant connection I doubt that> 
> real-time editing is possible.
> 	One could bump the commit time up to a few minutes or even seconds
> though so it would
> 	certainly seem real-time.
> 
> 	(Cameron, this approach would involve no editing of Frank's shapelib
> at all since all you're
> 	doing is reading and writing individual records. The effort goes
> into getting all the communication
> 	working right. Could even be a perl script with system calls to the
> shapelib utils for creating
> 	files and adding info.)
> 
> 	Steve
> 
> 	Stephen Lime
> 	Internet Applications Analyst
> 	MIS Bureau - MN DNR
> 
> 	(651) 297-2937
> 	steve.lime at dnr.state.mn.us
> 
> 	>>> "bfraser" <bfraser at geoanalytic.ab.ca> 10/12 11:00 AM >>>
> 	see my comments below...
> 
> 	Brent Fraser
> 
> 
> 	----- Original Message -----
> 	From: Cameron Shorter <cshorter at optusnet.com.au>
> 	To: mapserver <mapserver-users at lists.gis.umn.edu>
> 	Sent: Sunday, October 10, 1999 4:14 AM
> 	Subject: Updating Shapefiles, and data integrity
> 
> 
> 	>
> 	>
> 	> -------- Original Message --------
> 	> Subject: Re: request for comments...
> 	> Date: Sun, 10 Oct 1999 19:16:24 +1000
> 	> From: Cameron Shorter <cshorter at optusnet.com.au>
> 	> Reply-To: camerons at cat.org.au 
> 	> To: Stephen Lime <steve.lime at dnr.state.mn.us>
> 	> References: <s7ff017e.048 at smtp.dnr.state.mn.us>
> 	>
> 	>
> 	>
> 	> Stephen Lime wrote:
> 	> >
> 	> > Maintaining data integrity is going to be a big issue. I was at
> our
> 	state GIS conference and got to chat with Jack Dangermond from ESRI
> about
> 	the MapServer and their new ArcIMS product. Seems they're having
> trouble
> 	with this editing stuff. Shapefiles just aren't a transactional
> environment
> 	so unless you can assure yourself of single user access there's
> always the
> 	potential for multiple concurrent edits. Then there's the issue of
> quality
> 	control. I think the solution needs to offer immediate update and
> delayed
> 	update. ArcIMS, as I understand it, caches updates until an operator
> on the
> 	server site commits the edits to the main database. This operator
> could be a
> 	cron process I suppose that could handle locking while
> 	> > edits are processed. I think this may be a good approach as you
> could do
> 	some simple transaction management- review, edit and delete, once
> the
> 	initial work was done. Edits could be stored in a shapefile along
> with
> 	attributes and enough additional information to commit the shape -
> source
> 	shapefile, shape (or is a new one), type of edit (replace, attribute
> change)
> 	etc.
> 	> >
> 	> > Anyway, just my thoughts...
> 	> >
> 	> > Steve
> 	>
> 
> 	
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> 	v
> 
> 	Enterprise-wide editing can require a lot of infrastructure to
> support it.
> 	A large-scale
> 	implementation might include (this is only one scenario):
> 
> 	 o  a data repository  / warehouse / database
> 	 o  project workspaces for editing
> 	 o  a "view-only" copy of the data
> 
> 	Typical workflows would include:
> 
> 	1. Edit: operator identifies features in warehouse for editing,
> locks them,
> 	extracts them
> 	    to the project workspace.  The features are edited, possibly
> reviewed,
> 	then checked
> 	    back into the warehouse.  This is sometimes known as a "long
> 	transaction"
> 	    Some things that may be important:
> 	        1. feature level locking (as apposed to file locking) to
> prevent
> 	simultaneous editing
> 	        2. feature lineage tracking: timestamps, feature
> "retirement"
> 	instead of deletion
> 	        3. theme security: certain departments can edit only
> specific themes
> 
> 	2. Copy:  at a pre-determined schedule, the warehouse is copied to
> the
> 	"View-only"
> 	    database.  This may include re-formatting, indexing and
> distributing the
> 	data to get better
> 	    performance for viewing.  Depending on the edits, the copy could
> be once
> 	a day,
> 	    once a month, etc.  The good thing about this approach is that> 
> the user
> 	    (viewer/querier) has a stable data set to operate on.  The bad
> thing is
> 	it might not be
> 	    up to date.
> 
> 	3. Viewing: the data is queried and rendered for thick and thin
> client apps.
> 
> 	Of course all this might be unnecessary if you only have occasional
> edits
> 	and a few
> 	viewers....
> 
> 	
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> 	v
> 
> 
> 	> I'm glad to hear I'm not the only one having problems with
> updating
> 	> shapefiles. :)
> 	>
> 	> >From looking at the shapefile definition paper, you can see that
> there is
> 	an
> 	> index file .SHX which points to a .SHP file which has variable
> length
> 	records.
> 	>
> 	> There are a few problems that I can see.  Please verify if any of
> these
> 	are
> 	> correct or not.
> 	> 1. Deleting an old object.  I think this can be handled by setting
> the
> 	> shapetype to a NULL shape.
> 	>
> 	> 2. Increasing the number of vertices of a shape, and hence
> increasing the
> 	> record size.  I think the best way to handle this is to remove the
> old
> 	shape
> 	> by setting its shapetype to NULL, and to add a new shape to the
> end of the
> 	> .SHP file.  The pointer in the .SHX file will now have to be
> redirected to
> 	the
> 	> end of the .SHP file.  This now means that the order of the .SHP
> file and
> 	the
> 	> .SHX file will not match, which will reduce query speeds, so
> periodically
> 	the
> 	> datafiles would need to be rebuilt.
> 	>
> 	> 3. There is an issue with the .SHP file and .SHX file becoming out
> of
> 	sync.
> 	> Basically, when a shape is updated, first the .SHP file will need
> to be
> 	> updated, and some time later the .SHX file will be updated.  There
> is a
> 	window
> 	> of opportunity where the files will be out of sync.  I was
> planning to
> 	address
> 	> this by either putting in a lock file, or changing read/write
> permissions
> 	to
> 	> lock the files while the database is out of sync.
> 	> This means that some reads of the database will fail because the
> database
> 	is
> 	> updating.
> 	>
> 	> 4. I'm not sure what the best way is to link into a SQL database.
> If the
> 	> shapefile is only added to, then the best way to reference an
> object is by
> 	> using the index in the .SHX file.  However, if you delete an
> object,
> 	should
> 	> you rebuild the .SHX file?  This will keep the index file from
> blowing
> 	out,
> 	> but all the indexes will change and hence the  SQL database will
> reference
> 	the
> 	> wrong indices.
> 
> 	
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> 	v
> 	How about a unique key stored in the dbf file used to join to the
> SQL
> 	database?
> 
> 	This would allow for many shapefiles joining to a single SQL table
> (might be
> 	useful if the data is tiled.)
> 	
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> 	v
> 
> 
> 	>
> 	> Happy for any advice.
> 	>
> 	> Cameron.
> 	>
>