[gdal-dev] Re: Handling CPG (encoding) file
Even Rouault
even.rouault at mines-paris.org
Tue May 25 19:31:19 EDT 2010
Alexander,
I'm cc'ing Gaige Paulsen as he proposed in
http://trac.osgeo.org/gdal/ticket/3403 a patch with a similar approach to
yours, that is to say provide a method at the OGRLayer level to return the
encoding.
The more I think to this issue the more I recognize that the "UTF-8 everywhere
internally" is probably not practical in all situations, or at least doesn't
let enough control to the user. The UTF-8 as a pivot is - conceptually - OK
for the read part, but it doesn't help for the write part when a driver
doesn't support UTF-8 (or if for some compatibility reasons with other
software, we must write data in a certain encoding)
My main remark about your patch is I don't believe that the enum approach to
list the encodings is the best one. I'd be rather in favor of using a string,
and possibly sticking to the ones returned by 'iconv -l' so that we can
easily use the return of GetEncoding() to feed it into the converter through
CPLRecode(). I've experimented with it some time ago and have ready some
changes in cpl_recode_stub.cpp & configure to plug iconv support into it, in
order to extend its scope beyond the current hardcoded support for UTF8 and
ISO-8859-1.
We could imagine a -s_encoding, -t_encoding and -a_encoding switches to
ogr2ogr to let the user define the transcoding or encoding assignment. One of
the difficulty raised by Gaige in #3403 is the meaning of the width attribute
of an OGRFieldDefn object (number of bytes or number characters in a given
encoding), and how/if it will be affected by an encoding change.
The other issues raised by Gaige in his last comment are still worth
considering. For the read part, what do we want ? :
1) that the driver returns the data in its "raw" encoding and mentions the
encoding --> matches the approach of your proposal
2) that we ask it to return the data to UTF-8 when we don't care about the
data in its source encoding
3) that we can override its encoding when the source encoding is believed to
be incorrect so that 2) can work properly
1) and 2) approach are clearly following 2 differents tracks. One way to
reconcile both would be to provide some configuration/opening option to
choose which behaviour is prefered. RFC23 currently chooses 2) as it mandates
that "Any driver which knows it's encoding should convert to UTF-8." Well,
probably not a big deal since that any change related to how we deal with
encoding is likely to cause RFC23 to be amended anyway.
Personnaly, I'm not sure about which one is the best. I'm wondering what the
use cases for 1) are : when do we really want the data to be returned in its
source encoding --> will not be it converted later to UTF-8 at the
application level after the user has potentially selected/overriden the
source encoding ? In which case 3) would solve the problem. Just thinking
loud...
For the write part, a OGRSFDriver::GetSupportedEncodings() and
OGRLayer::SetEncoding() could make sense (for the later, if it must be
exposed at the datasource or layer level is an open point and a slight
difference between yours and Gaige's approach)
Best regards
Even
More information about the gdal-dev
mailing list