[Gdal-dev] Re: OGR - Character Encodings

Sat Oct 15 13:33:02 EDT 2005

Thinking about this a bit more, I'm uncomfortable hard-coding postgresql 
to LATIN1.  Sure that would work for TIGER data, but what if you open a 
datasource that is in Latin2, or Latin3?

Just to be clear, what is happening is that libpq is doing automatic 
encoding conversions between the client (the ogr library) and the server 
(postgresql).  Since libq is not told otherwise, on my machine it is 
assuming the client encoding is UTF8.  That works fine for characters 1 
through 127, but fails above that.

On solution that we've discussed is for ogr to use UTF8 internally and 
then its up to each datasource to convert to/from UTF8.  I would think 
that is the better "general" solution but it would a bit of more work. 
It would also require two encoding translations - source data source to 
UTF8, UTF8 to target (although as time goes on probably everything will 
support/use UTF8 so in the end you might not have to do any translations 
at all).

Perhaps a simpler approach for the time being is simply for OGR to 
inform the destination data source what the encoding of the source data 
source is.  Thus ogr wouldn't do any encoding translations.  For the 
postgresql data source that would work fine since it will take care of 
converting the encoding as described above - it just needs to know what 
the source encoding is.

One way this could be done is implement a "GetDSEncoding" method on 
OGRDataSource which would return the encoding of a data source.  For 
TIGER, that method would return ISO-88590-1.  For other datasources, it 
could just return NULL or some such thing for "unknown" until otherwise 
implemented.

You would then need to add some sort of method on OGRDataSource like 
"SetDataEncoding" which would tell the datasource what encoding incoming 
data is in.

So something like:

pSrcDS = <open the source data source>
pDstDS = <open the destination data source>

char* pSourceEncoding = pSrcDs.GetDSEncoding;
if (pSourceEncoding )
	pDstDs.SetDataEncoding(pSourceEncoding);

Then proceed as normal.

Charlie

Frank Warmerdam wrote:
> On 10/14/05, Charlie Savage <cfis at interserv.com> wrote:
>> This lead two several points.  First, is the assumed encoding always
>> ISO-88519?  In that case, the Postgresql call above is correct.
> 
> Charlie,
> 
> As you suspect, OGR is completely encoding-ignorant currently.
> I would encourage you to commit the change seting the encoding to
> LATIN1 for now, as I gather that is a more inclusive character set than
> the default UTF8.
> 
>> Second, what happens when you want to load maps for Asian countries?  Is
>> that a no-go at the moment?
> 
> OGR provides no special support for this.  In cases where double
> byte text has been encountered it is treated as if it were single byte
> which will presumably not work well with Postgres.
> 
>> Third, if OGR does support encodings, are they any plans to add this
>> functionality?
> 
> There are no plans currently to support encoding-awareness in OGR.
> 
> /me buries his head in the sand for a couple more years...
> 
> Best regards,
> --
> ---------------------------------------+--------------------------------------
> I set the clouds in motion - turn up   | Frank Warmerdam, warmerdam at pobox.com
> light and sound - activate the windows | http://pobox.com/~warmerdam
> and watch the world go round - Rush    | Geospatial Programmer for Rent