[gdal-dev] LDID and .CPG in OGR shapefile driver

Frank Warmerdam warmerdam at pobox.com
Wed Jul 14 09:50:50 EDT 2010


Francis Markham wrote:
> As discussed in
> http://lists.osgeo.org/pipermail/gdal-dev/2010-May/024619.html and
> http://lists.osgeo.org/pipermail/gdal-dev/2010-July/025192.html OGR's
> shapefile driver does not allow the shapefile's codepage to be set or
> retrieved using the DBF LDID byte or an *.cpg file.
> 
> This functionality is implemented in recent shapelib releases, when
> creating a new shapefile.
> 
> Issue #882 http://trac.osgeo.org/gdal/ticket/882 addresses this issue,
> but the discussion there largely predates RFCs 5 and 23 (
> http://trac.osgeo.org/gdal/wiki/rfc5_unicode and
> http://trac.osgeo.org/gdal/wiki/rfc23_ogr_unicode ).
> 
> I would be interested in exposing this shapelib feature in OGR.
> However, there are a number of design decisions to make:
> 
> 1) Should encoding retrieval and setting be an OGR wide feature, or
> one specific to the shapefile driver?

Francis,

Note that RFC 23 mandates that OGR layers return attributes in UTF-8,
so on "read" the expected action would be for the shapefile driver
to use the cpg and LDID files to identify the incoming encoding and
then use CPLRecode to convert to UTF-8.    So on read there is no
need for an OGR wide change.

On write I would anticipate the output encoding being set with
a layer creation option.  Ideally this layer option could be
the same for any other driver which needs the ability to set the
encoding on export but there is no need for any implementation
beyond the shapefile driver for now.

> 2) Should encodings be specified as a string or an enumeration of
> well-known encodings?  If encoding retrieval and setting occurs only
> at the shapefile driver level, then a string that mimics shapelib's
> API might be sensible (if the codepage is set to "LDID/n" and -1 < n <
> 255 then the ldid byte of the dbf is set to the n, otherwise the whole
> codepage string is written to the .CPG file).  Otherwise, commonsense
> would suggest a standardised enum of encodings might be the way to go.

They should be specified as strings, per RFC 23.  If there is no
apparent mapping to some shapefile output encoding, we might also
want to provide an extra mechanism to specify the encodings directly
as the codes used in the .cpg or LDID field.

> 3) What should the API be?  A patch at issue #882 creates two new
> OGRLayer member functions, GetEncoding() and SetEncoding(), and a
> GetEncoding() implementation for shapefiles (although it fails to
> allow the encoding to be set, as far as I can see).  

In my opinion there is no need for GetEncoding() and SetEncoding()
methods in the OGR API.

> Is this the appropriate place to have this discussion?  I would be
> happy to provide a patch implementing this feature however it is
> deemed most appropriate.

This is a reasonable place to have the discussion.  If you can
provide code implementing RFC 23 for the shapefile driver, with some
test samples to help demonstrate that would be much appreciated.  I'm
happy to have Chaitanya provide support as well.

Best regards,
-- 
---------------------------------------+--------------------------------------
I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
light and sound - activate the windows | http://pobox.com/~warmerdam
and watch the world go round - Rush    | Geospatial Programmer for Rent



More information about the gdal-dev mailing list