[Gdal-dev] RFC DRAFT: Unicode support in GDAL

Mateusz Loskot mateusz at loskot.net
Thu Sep 21 16:41:50 EDT 2006


Frank Warmerdam wrote:
> I persume ArcGIS is using some custom flag to keep track of this. If
> we can figure out what they did, we could also honour it.

OK, I think I'm close to know how ArcGIS stores codepage
information in Shapefile.

AFAIK so far, there are two variants:

1. Language driver ID stored in the header of DBF file.
It's 29th byte, 1 byte.


2. There is an associated file with the same name as other Shapefile
files, but with .CPG extension, i.e. countries.shp and countries.cpg

Here I found some sample Shapefile file that includes .cpg:

http://www.unc.edu/courses/2006spring/geog/070/001/mkjohnso/Lab%209/?C=M;O=A


There is Shapefile.prj, Shapefile.shp, Shapefile.cpg, etc.

The .cpg file simply stores codepage identitfier

Here is a list of possible/all (?) codepage identifiers + some
helpful explanation;

http://www.forumsig.org/archive/index.php/t-439.html


How ArcGIS handles codepage using these two indicators above?

<http://support.esri.com/index.cfm?fa=knowledgebase.techArticles.articleShow&d=21106>

"When opening a shapefile and dBASE file in ArcGIS Desktop, the Desktop
programs look at the Language Driver ID (LDID) in the header of a dBASE
file, or an associated *.CPG file, which are both used to define the
code page, in order to determine the code page of the file that is read."


I'm not 100%, but it seems this information makes it possible to
implement codepage support for Shapefile, at least ;-)
So, UTF-8 may be supported as well.

Cheers
-- 
Mateusz Loskot
http://mateusz.loskot.net



More information about the Gdal-dev mailing list