<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Even,<div> Thanks for bringing this to my attention. We're winding down a major release here (hence my relative absence from the mailing list except as a lurker with occasional comments), but this issue is one I wanted to revisit this summer. Snips and comments below</div><div><br><div><div>On May 25, 2010, at 7:31 PM, Even Rouault wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Alexander,<br><br>I'm cc'ing Gaige Paulsen as he proposed in <br><a href="http://trac.osgeo.org/gdal/ticket/3403">http://trac.osgeo.org/gdal/ticket/3403</a> a patch with a similar approach to <br>yours, that is to say provide a method at the OGRLayer level to return the <br>encoding.<br><br>The more I think to this issue the more I recognize that the "UTF-8 everywhere <br>internally" is probably not practical in all situations, or at least doesn't <br>let enough control to the user. The UTF-8 as a pivot is - conceptually - OK <br>for the read part, but it doesn't help for the write part when a driver <br>doesn't support UTF-8 (or if for some compatibility reasons with other <br>software, we must write data in a certain encoding)<br><br>My main remark about your patch is I don't believe that the enum approach to <br>list the encodings is the best one. I'd be rather in favor of using a string, <br>and possibly sticking to the ones returned by 'iconv -l' so that we can <br>easily use the return of GetEncoding() to feed it into the converter through <br>CPLRecode(). I've experimented with it some time ago and have ready some <br>changes in cpl_recode_stub.cpp & configure to plug iconv support into it, in <br>order to extend its scope beyond the current hardcoded support for UTF8 and <br>ISO-8859-1.<br></div></blockquote><div><br></div><div>I agree with using strings instead of enums for this. </div><br><blockquote type="cite"><div>We could imagine a -s_encoding, -t_encoding and -a_encoding switches to <br>ogr2ogr to let the user define the transcoding or encoding assignment. One of <br>the difficulty raised by Gaige in #3403 is the meaning of the width attribute <br>of an OGRFieldDefn object (number of bytes or number characters in a given <br>encoding), and how/if it will be affected by an encoding change.<br><br>The other issues raised by Gaige in his last comment are still worth <br>considering. For the read part, what do we want ? :<br>1) that the driver returns the data in its "raw" encoding and mentions the <br>encoding --> matches the approach of your proposal<br>2) that we ask it to return the data to UTF-8 when we don't care about the <br>data in its source encoding<br>3) that we can override its encoding when the source encoding is believed to <br>be incorrect so that 2) can work properly<br></div></blockquote><div><br></div><div>I still think that UTF-8 as a pivot makes sense and works well for most cases (we tend to use it internally as well). And, mostly I prefer the use of #3. I was specifically looking at these different problems: </div><div>- Situations where a format handles multiple encodings, but the encoding in the file being read is either ambiguous or incorrect. </div><div>- Situations where there is a need to determine which of multiple encodings to use (sometimes necessary for compatibility reasons)</div><div>- Situations where storage space is very tight and field width must be intelligently degraded </div><div><br></div><blockquote type="cite"><div>1) and 2) approach are clearly following 2 differents tracks. One way to <br>reconcile both would be to provide some configuration/opening option to <br>choose which behaviour is prefered. RFC23 currently chooses 2) as it mandates <br>that "Any driver which knows it's encoding should convert to UTF-8." Well, <br>probably not a big deal since that any change related to how we deal with <br>encoding is likely to cause RFC23 to be amended anyway.<br></div></blockquote><blockquote type="cite"><div><font class="Apple-style-span" color="#000000"><br></font>Personnaly, I'm not sure about which one is the best. I'm wondering what the <br>use cases for 1) are : when do we really want the data to be returned in its <br>source encoding --> will not be it converted later to UTF-8 at the <br>application level after the user has potentially selected/overriden the <br>source encoding ? In which case 3) would solve the problem. Just thinking <br>loud...<br></div></blockquote><div><br></div><div>I think 3 could well solve the problem here. The only downside that I see is that it might make implementation of auto-detection difficult if someone were to attempt to do that in a manner that not everyone agreed on. However, the cases that I see above would be resolved by #3.</div><br><blockquote type="cite"><div><br>For the write part, a OGRSFDriver::GetSupportedEncodings() and <br>OGRLayer::SetEncoding() could make sense (for the later, if it must be <br>exposed at the datasource or layer level is an open point and a slight <br>difference between yours and Gaige's approach)<font class="Apple-style-span" color="#000000"><font class="Apple-style-span" color="#144FAE"><br></font></font></div></blockquote><br></div><div>Is there a need for a per-layer approach to this? I've yet to see a format that allowed different encodings in different layers. Although, thinking about it, it might be a problem using some of the virtual data sets, since they hide some of this.</div><div><br></div><div>-Gaige</div><div><br></div><div><br></div></div></body></html>