Updating Shapefiles, and data integrity
Sullivan, James R.
SullivanJ at nima.mil
Wed Oct 13 05:51:32 PDT 1999
Believe a better approach would be to develop a Mapserver interface to
Oracle Spatial or ESRI SDE. Let the database do all the locking, etc.
This would answer the mail on a number of other issues, too. Both have open
C api that can be downloaded via the net.
Jim Sullivan
NIMA / TES
-----Original Message-----
From: Stephen Lime [SMTP:steve.lime at dnr.state.mn.us]
Sent: Tuesday, October 12, 1999 5:37 PM
To: camerons at cat.org.au; bfraser at geoanalytic.ab.ca;
mapserver-users at lists.gis.umn.edu
Subject: Re: Updating Shapefiles, and data integrity
The more I think about this, the more I understand why I never
attempted it.
Locking is a real pain in the CGI world. When do you lock, when a
record
is requested or when edits are submitted? If the latter then there
is a chance
more than one person could request the same shape. I don't think
that
on-the-fly edits are possible robustly. Somehow I think edits need
to be cached
and commited "behind the scenes". It's essential the shp, shx and
dbf records
remain in sync. What about something like this:
Assuming the there is a mechanism to request a shape and attributes
(and a checkout
time) and make changes. A user now sends back some edits. This
causes a record to
be written to a "pending" database. What actually gets saved are
things like source
shapefile, feature id (-1 for new), timestamp, etc. The actually
edits get saved in some format
(shapefile) as a file whose name can be reconstructed from elements
in the pending database.
Now, periodically a process could go through and commit the edits in
the pending database
to production (not web accessible) versions. When this is finished
the updated stuff
could be swapped in for the old stuff and the pending database
purged (in part). The
commiting of the shapes would essentially involve rebuilding the
shapefile from the user
edits and the production version (i.e. pick the edited version if it
exists). Put a lock in
place while versions are being swapped and remove it when done,
probably only a
few seconds. You could even maintain a history by saving previous
versions for
some period of time or retiring shapes to some external format
(shapefile).
As requests for shapes come in a quick check of the pending database
could be used
to identify re-edits. If a timestamp is set when a shape is
requested then it could be
compared against edits in the pending database to identify possible
problems. If a user
requests an edited shape just send the pending edits as if they were
part of the current
shapefile. New shapes are just added to the pending database and
make their way
into the main database as part of the update process.
Sounds complicated but really is only 2 processes, 1 database and a
bunch of cached
edits. Timestamps can help alleviate simultaneous edits and a the
worst thing a user
would see would be a message like "The record you're submitting has
changed since
you requested it, cannot process the edit. Would you like to work
from the edited version?".
Again, without some sort of a persistant connection I doubt that
real-time editing is possible.
One could bump the commit time up to a few minutes or even seconds
though so it would
certainly seem real-time.
(Cameron, this approach would involve no editing of Frank's shapelib
at all since all you're
doing is reading and writing individual records. The effort goes
into getting all the communication
working right. Could even be a perl script with system calls to the
shapelib utils for creating
files and adding info.)
Steve
Stephen Lime
Internet Applications Analyst
MIS Bureau - MN DNR
(651) 297-2937
steve.lime at dnr.state.mn.us
>>> "bfraser" <bfraser at geoanalytic.ab.ca> 10/12 11:00 AM >>>
see my comments below...
Brent Fraser
----- Original Message -----
From: Cameron Shorter <cshorter at optusnet.com.au>
To: mapserver <mapserver-users at lists.gis.umn.edu>
Sent: Sunday, October 10, 1999 4:14 AM
Subject: Updating Shapefiles, and data integrity
>
>
> -------- Original Message --------
> Subject: Re: request for comments...
> Date: Sun, 10 Oct 1999 19:16:24 +1000
> From: Cameron Shorter <cshorter at optusnet.com.au>
> Reply-To: camerons at cat.org.au
> To: Stephen Lime <steve.lime at dnr.state.mn.us>
> References: <s7ff017e.048 at smtp.dnr.state.mn.us>
>
>
>
> Stephen Lime wrote:
> >
> > Maintaining data integrity is going to be a big issue. I was at
our
state GIS conference and got to chat with Jack Dangermond from ESRI
about
the MapServer and their new ArcIMS product. Seems they're having
trouble
with this editing stuff. Shapefiles just aren't a transactional
environment
so unless you can assure yourself of single user access there's
always the
potential for multiple concurrent edits. Then there's the issue of
quality
control. I think the solution needs to offer immediate update and
delayed
update. ArcIMS, as I understand it, caches updates until an operator
on the
server site commits the edits to the main database. This operator
could be a
cron process I suppose that could handle locking while
> > edits are processed. I think this may be a good approach as you
could do
some simple transaction management- review, edit and delete, once
the
initial work was done. Edits could be stored in a shapefile along
with
attributes and enough additional information to commit the shape -
source
shapefile, shape (or is a new one), type of edit (replace, attribute
change)
etc.
> >
> > Anyway, just my thoughts...
> >
> > Steve
>
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
v
Enterprise-wide editing can require a lot of infrastructure to
support it.
A large-scale
implementation might include (this is only one scenario):
o a data repository / warehouse / database
o project workspaces for editing
o a "view-only" copy of the data
Typical workflows would include:
1. Edit: operator identifies features in warehouse for editing,
locks them,
extracts them
to the project workspace. The features are edited, possibly
reviewed,
then checked
back into the warehouse. This is sometimes known as a "long
transaction"
Some things that may be important:
1. feature level locking (as apposed to file locking) to
prevent
simultaneous editing
2. feature lineage tracking: timestamps, feature
"retirement"
instead of deletion
3. theme security: certain departments can edit only
specific themes
2. Copy: at a pre-determined schedule, the warehouse is copied to
the
"View-only"
database. This may include re-formatting, indexing and
distributing the
data to get better
performance for viewing. Depending on the edits, the copy could
be once
a day,
once a month, etc. The good thing about this approach is that
the user
(viewer/querier) has a stable data set to operate on. The bad
thing is
it might not be
up to date.
3. Viewing: the data is queried and rendered for thick and thin
client apps.
Of course all this might be unnecessary if you only have occasional
edits
and a few
viewers....
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
v
> I'm glad to hear I'm not the only one having problems with
updating
> shapefiles. :)
>
> >From looking at the shapefile definition paper, you can see that
there is
an
> index file .SHX which points to a .SHP file which has variable
length
records.
>
> There are a few problems that I can see. Please verify if any of
these
are
> correct or not.
> 1. Deleting an old object. I think this can be handled by setting
the
> shapetype to a NULL shape.
>
> 2. Increasing the number of vertices of a shape, and hence
increasing the
> record size. I think the best way to handle this is to remove the
old
shape
> by setting its shapetype to NULL, and to add a new shape to the
end of the
> .SHP file. The pointer in the .SHX file will now have to be
redirected to
the
> end of the .SHP file. This now means that the order of the .SHP
file and
the
> .SHX file will not match, which will reduce query speeds, so
periodically
the
> datafiles would need to be rebuilt.
>
> 3. There is an issue with the .SHP file and .SHX file becoming out
of
sync.
> Basically, when a shape is updated, first the .SHP file will need
to be
> updated, and some time later the .SHX file will be updated. There
is a
window
> of opportunity where the files will be out of sync. I was
planning to
address
> this by either putting in a lock file, or changing read/write
permissions
to
> lock the files while the database is out of sync.
> This means that some reads of the database will fail because the
database
is
> updating.
>
> 4. I'm not sure what the best way is to link into a SQL database.
If the
> shapefile is only added to, then the best way to reference an
object is by
> using the index in the .SHX file. However, if you delete an
object,
should
> you rebuild the .SHX file? This will keep the index file from
blowing
out,
> but all the indexes will change and hence the SQL database will
reference
the
> wrong indices.
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
v
How about a unique key stored in the dbf file used to join to the
SQL
database?
This would allow for many shapefiles joining to a single SQL table
(might be
useful if the data is tiled.)
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
v
>
> Happy for any advice.
>
> Cameron.
>
More information about the MapServer-users
mailing list