[Gdal-dev] OGR - Character Encodings

Fri Oct 14 12:24:49 EDT 2005

If you try to load the Tiger data for Guam to Postgresql, the 
entitynames table will fail.  The first line in this file is:

C0604     2000   75O      36163        Hagåtña, GU 

Tiger data is in ISO-88519 (LATIN1), while a default connection to 
Postgresql seems to be UTF8 (at least on my machine, WindowsXP).  The 
result is this error message from Postgresql:

Invalid UNICODE byte sequence detected near byte

The same problem occurs for a lot of the data for Puerto Rico, and one 
file in Wisconsin.

The solution is both easy and hard.  The easy bit is that after 
connecting to Postgresql, OGR should call the PQsetClientEncoding 
function.  In this case:

     char* encoding = "LATIN1";
     if (PQsetClientEncoding(hPGConn, encoding) == -1)
     {
         CPLError( CE_Failure, CPLE_AppDefined,
                   "PQsetClientEncoding failed.  Encoding: %s", encoding);
         PQfinish(hPGConn);
         hPGConn = NULL;
         return FALSE;
     }

The hard bit is what character encoding to specify.  The best solution 
is that OGR would specify the encoding of each data souce it opens. 
Unfortunately, as far as I can see, OGR has no support for different 
character encodings (either telling you what they are, or working with 
multi-byte encodings).

This lead two several points.  First, is the assumed encoding always 
ISO-88519?  In that case, the Postgresql call above is correct.

Second, what happens when you want to load maps for Asian countries?  Is 
that a no-go at the moment?

Third, if OGR does support encodings, are they any plans to add this 
functionality?

Thanks,

Charlie