[gdal-dev] Re: Handling CPG (encoding) file
Gaige B. Paulsen
osgeo at gbp.gaige.net
Wed May 26 10:27:17 EDT 2010
Even,
Thanks for bringing this to my attention. We're winding down a major release here (hence my relative absence from the mailing list except as a lurker with occasional comments), but this issue is one I wanted to revisit this summer. Snips and comments below
On May 25, 2010, at 7:31 PM, Even Rouault wrote:
> Alexander,
>
> I'm cc'ing Gaige Paulsen as he proposed in
> http://trac.osgeo.org/gdal/ticket/3403 a patch with a similar approach to
> yours, that is to say provide a method at the OGRLayer level to return the
> encoding.
>
> The more I think to this issue the more I recognize that the "UTF-8 everywhere
> internally" is probably not practical in all situations, or at least doesn't
> let enough control to the user. The UTF-8 as a pivot is - conceptually - OK
> for the read part, but it doesn't help for the write part when a driver
> doesn't support UTF-8 (or if for some compatibility reasons with other
> software, we must write data in a certain encoding)
>
> My main remark about your patch is I don't believe that the enum approach to
> list the encodings is the best one. I'd be rather in favor of using a string,
> and possibly sticking to the ones returned by 'iconv -l' so that we can
> easily use the return of GetEncoding() to feed it into the converter through
> CPLRecode(). I've experimented with it some time ago and have ready some
> changes in cpl_recode_stub.cpp & configure to plug iconv support into it, in
> order to extend its scope beyond the current hardcoded support for UTF8 and
> ISO-8859-1.
I agree with using strings instead of enums for this.
> We could imagine a -s_encoding, -t_encoding and -a_encoding switches to
> ogr2ogr to let the user define the transcoding or encoding assignment. One of
> the difficulty raised by Gaige in #3403 is the meaning of the width attribute
> of an OGRFieldDefn object (number of bytes or number characters in a given
> encoding), and how/if it will be affected by an encoding change.
>
> The other issues raised by Gaige in his last comment are still worth
> considering. For the read part, what do we want ? :
> 1) that the driver returns the data in its "raw" encoding and mentions the
> encoding --> matches the approach of your proposal
> 2) that we ask it to return the data to UTF-8 when we don't care about the
> data in its source encoding
> 3) that we can override its encoding when the source encoding is believed to
> be incorrect so that 2) can work properly
I still think that UTF-8 as a pivot makes sense and works well for most cases (we tend to use it internally as well). And, mostly I prefer the use of #3. I was specifically looking at these different problems:
- Situations where a format handles multiple encodings, but the encoding in the file being read is either ambiguous or incorrect.
- Situations where there is a need to determine which of multiple encodings to use (sometimes necessary for compatibility reasons)
- Situations where storage space is very tight and field width must be intelligently degraded
> 1) and 2) approach are clearly following 2 differents tracks. One way to
> reconcile both would be to provide some configuration/opening option to
> choose which behaviour is prefered. RFC23 currently chooses 2) as it mandates
> that "Any driver which knows it's encoding should convert to UTF-8." Well,
> probably not a big deal since that any change related to how we deal with
> encoding is likely to cause RFC23 to be amended anyway.
>
> Personnaly, I'm not sure about which one is the best. I'm wondering what the
> use cases for 1) are : when do we really want the data to be returned in its
> source encoding --> will not be it converted later to UTF-8 at the
> application level after the user has potentially selected/overriden the
> source encoding ? In which case 3) would solve the problem. Just thinking
> loud...
I think 3 could well solve the problem here. The only downside that I see is that it might make implementation of auto-detection difficult if someone were to attempt to do that in a manner that not everyone agreed on. However, the cases that I see above would be resolved by #3.
>
> For the write part, a OGRSFDriver::GetSupportedEncodings() and
> OGRLayer::SetEncoding() could make sense (for the later, if it must be
> exposed at the datasource or layer level is an open point and a slight
> difference between yours and Gaige's approach)
Is there a need for a per-layer approach to this? I've yet to see a format that allowed different encodings in different layers. Although, thinking about it, it might be a problem using some of the virtual data sets, since they hide some of this.
-Gaige
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.osgeo.org/pipermail/gdal-dev/attachments/20100526/4631bb80/attachment.html
More information about the gdal-dev
mailing list