[gdal-dev] Re: Handling CPG (encoding) file

Wed May 26 10:27:17 EDT 2010

Even,
   Thanks for bringing this to my attention.   We're winding down a major release here (hence my relative absence from the mailing list except as a lurker with occasional comments), but this issue is one I wanted to revisit this summer.   Snips and comments below

On May 25, 2010, at 7:31 PM, Even Rouault wrote:

> Alexander,
> 
> I'm cc'ing Gaige Paulsen as he proposed in 
> http://trac.osgeo.org/gdal/ticket/3403 a patch with a similar approach to 
> yours, that is to say provide a method at the OGRLayer level to return the 
> encoding.
> 
> The more I think to this issue the more I recognize that the "UTF-8 everywhere 
> internally" is probably not practical in all situations, or at least doesn't 
> let enough control to the user. The UTF-8 as a pivot is - conceptually - OK 
> for the read part, but it doesn't help for the write part when a driver 
> doesn't support UTF-8 (or if for some compatibility reasons with other 
> software, we must write data in a certain encoding)
> 
> My main remark about your patch is I don't believe that the enum approach to 
> list the encodings is the best one. I'd be rather in favor of using a string, 
> and possibly sticking to the ones returned by 'iconv -l' so that we can 
> easily use the return of GetEncoding() to feed it into the converter through 
> CPLRecode(). I've experimented with it some time ago and have ready some 
> changes in cpl_recode_stub.cpp & configure to plug iconv support into it, in 
> order to extend its scope beyond the current hardcoded support for UTF8 and 
> ISO-8859-1.

I agree with using strings instead of enums for this.      

> We could imagine a -s_encoding, -t_encoding and -a_encoding switches to 
> ogr2ogr to let the user define the transcoding or encoding assignment. One of 
> the difficulty raised by Gaige in #3403 is the meaning of the width attribute 
> of an OGRFieldDefn object (number of bytes or number characters in a given 
> encoding), and how/if it will be affected by an encoding change.
> 
> The other issues raised by Gaige in his last comment are still worth 
> considering. For the read part, what do we want ? :
> 1) that the driver returns the data in its "raw" encoding and mentions the 
> encoding --> matches the approach of your proposal
> 2) that we ask it to return the data to UTF-8 when we don't care about the 
> data in its source encoding
> 3) that we can override its encoding when the source encoding is believed to 
> be incorrect so that 2) can work properly

I still think that UTF-8 as a pivot makes sense and works well for most cases (we tend to use it internally as well).   And, mostly I prefer the use of #3. I was specifically looking at these different problems: 
- Situations where a format handles multiple encodings, but the encoding in the file being read is either ambiguous or incorrect. 
- Situations where there is a need to determine which of multiple encodings to use (sometimes necessary for compatibility reasons)
- Situations where storage space is very tight and field width must be intelligently degraded 

> 1) and 2) approach are clearly following 2 differents tracks. One way to 
> reconcile both would be to provide some configuration/opening option to 
> choose which behaviour is prefered. RFC23 currently chooses 2) as it mandates 
> that "Any driver which knows it's encoding should convert to UTF-8." Well, 
> probably not a big deal since that any change related to how we deal with 
> encoding is likely to cause RFC23 to be amended anyway.
> 
> Personnaly, I'm not sure about which one is the best. I'm wondering what the 
> use cases for 1) are : when do we really want the data to be returned in its 
> source encoding --> will not be it converted later to UTF-8 at the 
> application level after the user has potentially selected/overriden the 
> source encoding ? In which case 3) would solve the problem. Just thinking 
> loud...

I think 3 could well solve the problem here.   The only downside that I see is that it might make implementation of auto-detection difficult if someone were to attempt to do that in a manner that not everyone agreed on.    However, the cases that I see above would be resolved by #3.

> 
> For the write part, a OGRSFDriver::GetSupportedEncodings() and 
> OGRLayer::SetEncoding() could make sense (for the later, if it must be 
> exposed at the datasource or layer level is an open point and a slight 
> difference between yours and Gaige's approach)

Is there a need for a per-layer approach to this?   I've yet to see a format that allowed different encodings in different layers.   Although, thinking about it, it might be a problem using some of the virtual data sets, since they hide some of this.

-Gaige

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osgeo.org/pipermail/gdal-dev/attachments/20100526/4631bb80/attachment.html