[gdal-dev] Re: Handling CPG (encoding) file

Tue May 25 19:31:19 EDT 2010

Alexander,

I'm cc'ing Gaige Paulsen as he proposed in 
http://trac.osgeo.org/gdal/ticket/3403 a patch with a similar approach to 
yours, that is to say provide a method at the OGRLayer level to return the 
encoding.

The more I think to this issue the more I recognize that the "UTF-8 everywhere 
internally" is probably not practical in all situations, or at least doesn't 
let enough control to the user. The UTF-8 as a pivot is - conceptually - OK 
for the read part, but it doesn't help for the write part when a driver 
doesn't support UTF-8 (or if for some compatibility reasons with other 
software, we must write data in a certain encoding)

My main remark about your patch is I don't believe that the enum approach to 
list the encodings is the best one. I'd be rather in favor of using a string, 
and possibly sticking to the ones returned by 'iconv -l' so that we can 
easily use the return of GetEncoding() to feed it into the converter through 
CPLRecode(). I've experimented with it some time ago and have ready some 
changes in cpl_recode_stub.cpp & configure to plug iconv support into it, in 
order to extend its scope beyond the current hardcoded support for UTF8 and 
ISO-8859-1.

We could imagine a -s_encoding, -t_encoding and -a_encoding switches to 
ogr2ogr to let the user define the transcoding or encoding assignment. One of 
the difficulty raised by Gaige in #3403 is the meaning of the width attribute 
of an OGRFieldDefn object (number of bytes or number characters in a given 
encoding), and how/if it will be affected by an encoding change.

The other issues raised by Gaige in his last comment are still worth 
considering. For the read part, what do we want ? :
1) that the driver returns the data in its "raw" encoding and mentions the 
encoding --> matches the approach of your proposal
2) that we ask it to return the data to UTF-8 when we don't care about the 
data in its source encoding
3) that we can override its encoding when the source encoding is believed to 
be incorrect so that 2) can work properly

1) and 2) approach are clearly following 2 differents tracks. One way to 
reconcile both would be to provide some configuration/opening option to 
choose which behaviour is prefered. RFC23 currently chooses 2) as it mandates 
that "Any driver which knows it's encoding should convert to UTF-8." Well, 
probably not a big deal since that any change related to how we deal with 
encoding is likely to cause RFC23 to be amended anyway.

Personnaly, I'm not sure about which one is the best. I'm wondering what the 
use cases for 1) are : when do we really want the data to be returned in its 
source encoding --> will not be it converted later to UTF-8 at the 
application level after the user has potentially selected/overriden the 
source encoding ? In which case 3) would solve the problem. Just thinking 
loud...

For the write part, a OGRSFDriver::GetSupportedEncodings() and 
OGRLayer::SetEncoding() could make sense (for the later, if it must be 
exposed at the datasource or layer level is an open point and a slight 
difference between yours and Gaige's approach)

Best regards 

Even