[postgis-devel] Removing geometry parsing checks for data loading
David Fuhry
dfuhry at acm.org
Fri Jul 4 13:51:42 PDT 2008
Mark,
Thanks for clarifying; I understood the scope of your patch when I
sent yesterday's email, but not when I sent the previous one a few weeks
ago. And thanks for the patch; I'll use it.
I agree that the shapefiles with unclosed polygons are totally
invalid. But I need those records' attributes, and their siblings
records' valid geometries in the db. Besides shapefile preprocessing or
loading all geom records as "bytea"s, relaxing the geometry parser is
the least-hackish way I see to bulk load them.
From a performance perspective, it's very clear why shp2pgsql
performs the checks it does (and no others): they're cheap. But from a
consistency perspective, one could argue that it's inconsistent that the
loader enforces some validity checks (closure, # points) but not others
(self-intersection, counterclockwise-ordered boundary, etc.). I guess
that's why some people expect that the (E)WKB parser do nothing but
decode the base64 string into a byte array, and other people expect that
if a load succeeds, that st_area2d(), st_perimeter(), etc. will always
return the perceived "correct" value.
Paul: Manually closing the polygons is a great suggestion for data
cleanup. But for instance, I'd rather NULL bad geometries for this app.
Given this, if the loader auto-closed, I'd have no way of knowing
which polygons were previously unclosed (in the shapefile) by the time
the data hit the db. Conversely, if the loader auto-NULLed bad
geometries, I'd have no way to know which polygons in the shapefile were
originally NULL, and which the loader had NULLed because of geometry
errors. (You're right; loading polygons as MULTILINESTRINGs seems the
only way to load such data in a valid way. Ugly though!)
Similar ambiguities exist for the "num_points % num_dimensions == 0" and
"num_poly_vertices >= 3" checks. And shp2pgsql's logic would always
have to be synced with the (E)WKB parser's. (To me) seems to be a
slippery slope to add geometry validation/modification to shp2pgsql.
Thanks,
Dave
Paul Ramsey wrote:
> My experience has been that in the case of non-closed polygons, the
> final duplicate vertex is often "implied". So one way to "have our
> cake and eat it too" would be to add the duplicate vertex at load
> time. If that sounds overly harsh, than probably the things you are
> loading aren't actually polygons, so you should go back and call them
> linestrings and start again.
>
> P.
>
> On Fri, Jul 4, 2008 at 2:23 AM, Mark Cave-Ayland
> <mark.cave-ayland at siriusit.co.uk> wrote:
>> David Fuhry wrote:
>>> $ shp2pgsql -D poly_1.shp poly_1 | psql
>>> ...
>>> ERROR: geometry contains non-closed rings
>>> CONTEXT: COPY poly_1, line 389, column the_geom:
>>>
>>> "01060000000100000001030000000100000007000000F06044A5C0B53E41086D2D657C8C23413D5FFD71BFB53E4191925462..."
>>> ROLLBACK
>>>
>>> What's the best way to drop these geometry-parsing checks (from
>>> lwgparse.c, I assume)? Is commenting out the if (...) error(...); lines
>>> in popc() sufficient and safe?
>>>
>>> Obviously as part of my import routine, I check that the geometries are
>>> isvalid() later. For my uses, geometry errors are best resolved with
>>> custom logic /after/ the data is in the db.
>>>
>>> If Mark and others have serious concerns about people naively using
>>> non-isvalid() geometries, then I suggest a configurable parameter to
>>> shp2pgsql (on by default) which adds an isvalid() check constraint to
>>> the table. The call to AddGeometryColumn() already adds ndims,
>>> geometrytype, and srid checks to the table; why not the equally
>>> important isvalid() too?
>>>
>>> Thanks,
>>>
>>> Dave Fuhry
>>
>> Hi Dave,
>>
>> I think I may not have explained myself clearly enough here. When I fixed
>> the WKB parser, I didn't add any additional calls into isvalid(). IsValid()
>> is a particularly heavyweight function which is why currently this is left
>> up to the user to decide whether they wish to use it or not. This is because
>> it performs various intersection and repeated point tests which can involve
>> lots of time-consuming matrix maths.
>>
>> Now the patch I added to the 1.3 series was to make the WKT and WKB parsers
>> use the same validation code. The only checks that the WK(T/B) parser makes
>> for polygons are i) ensure that polygon rings have a minimum of 3 points and
>> ii) ensure that polygon rings are closed (i.e. the first point equals the
>> last point). Both of these do seem entirely sane checks to be making.
>>
>> I've done a quick check of the SFS specification
>> (http://www.opengis.org/docs/99-049.pdf) and the ESRI shapefile
>> specification (http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf)
>> and both of these mandate that polygon rings *must* be closed. And to be
>> honest, I can see that this makes sense - for example, think about how
>> should things behave if you try to calculate the area of an open polygon,
>> for example.
>>
>> So in other words, whoever (or whatever) is generating your shapefiles is
>> generating invalid shapefiles according to spec, and I'm not yet convinced
>> that this is a good enough reason to start allowing geometries with open
>> rings into the database.
>>
>> If you really are stuck, I've attached a patch against 1.3 series that will
>> turn off the check for WKB, but it will take a lot more convincing in order
>> for this (or other relevant fix) to get into SVN.
>>
>>
>> ATB,
>>
>> Mark.
>>
>> --
>> Mark Cave-Ayland
>> Sirius Corporation - The Open Source Experts
>> http://www.siriusit.co.uk
>> T: +44 870 608 0063
>>
>> _______________________________________________
>> postgis-devel mailing list
>> postgis-devel at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-devel
>>
>>
> _______________________________________________
> postgis-devel mailing list
> postgis-devel at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-devel
More information about the postgis-devel
mailing list