[Shapelib] Re: shapelib improvements

Wed Dec 5 10:04:33 PST 2007

Hi all,

Because I'm new here, I'll introduce myself first.  My name is Bram de
Greve and I'm doing some work on pyshapelib, the Python bindings of
shapelib.  This is work is mainly motivated by Thuban
(http://thuban.intevation.org/), a open source interactive geographic
data viewer, of which pyshapelib is a part.  The improvements in
pyshapelib are focused on Unicode support, and to do that, so changes in
the shapelib code itself were necessary.  That's how I contacted Frank
to see if these developments on shapelib could be shared too keep true
to the "one" shapelib implementation, so to say =)  Below, you can see
some of the communication I had with Frank, and so by sending my reply
to this list, we are asking feedback from the list =)

Warning: this is quite a long post =)

Regards,
Bram

Frank Warmerdam wrote:
> Bram de Greve wrote:
>> (1) add wide character support for windows:
>>
>> To get full unicode support for filenames on windows, we need to use the
>> wide character file APIs, like _wfopen instead of fopen.  This will
>> require additional functions next to SHPOpen and the like.  On windows
>> platforms, you would also have SHPOpenW.  Because SHPOpen en SHPOpenW
>> would contain the same code once the files are opened, this duplicate
>> code would be moved to a version of SHPOpen that takes open file
>> handlers.  Don't know what a good name would be SHPOpenHelper,
>> SHPOpenEx, SHPOpenX?
>
> Bram,
>
> I'm not the greatest fan of windows wide character functions which has
> lead me to put off responding.  Also, I've yet to find any credible
> explanation for why it is necessary.  Are there really file names I
> can't open with fopen()?
>
> But the other issue is that I have a pressing client need to support
> shapefiles (well more likely DBF files) larger than 2GB.  And on windows
> this means switching from fopen() to the win32 API.  I'm contemplating
> whether it is time for IO to be "hookable" in Shapelib where the caller
> can provide their own read/write functions.
>
> If I do that, it mike take the form of SHPOpenLL() (LL = Low Level) which
> takes in a "file handle" (a void *) and pointers to read, write and seek
> functions.  SHPOpen() would just use fopen() and fread(), fwrite() and
> fseek().  But someone who wanted wide filenames could open the file
> themselves and call SHPOpenLL().  Or we could even provide a SHPOpenW().
>
> Does this approach make sense? Actually, on further consideration this
> does not take into account the
> need to manipulate filenames to derive the .shp, .shx, .dbf, .cpg.  Grr.
> I'm sort of back at square one, not clear on how to address wide
> character without undue complication. 
Frank,

I'm not really a fan either, but I'm thinking of filenames with Japanese
characters, or symbols like pi, phi, ...  Granted, it's not something
*I* would be using (and I assume you neither), but some people might beg
to differ.  So, the issue at hand is filenames with unicode symbols. 
The first thing I tried was to encode them into streams of narrow
characters (MBCS), but that failed to "cover all the unicode characters"
somehow.  So, then I decided to use the wide character functions instead. 

This was however quite a long time ago, and now I wonder.  See, I tried
that encoding using Python routines (we're talking about pyshapelib
here), since I was doing that successfully for the linux.  See, Python
somehow "knows" what encoding to use for different filesystems, like
UTF-8 for Apple, and whatever CODESET the user has set on Linux.  For
windows, Python tells to use the MBCS codec, which is described as
"Windows only: Encode operand according to the ANSI codepage (CP_ACP)". 
That means that depending onn the code page the machine is using, it may
or may not support the unicode characters in question.  So, to really
access those files with unicode filenames, you need to use _wfopen.

Anyway, if SHPOpenLL would be implemented to allow the 2GB+ files, we
can easily provide SHPOpenW and shield them off with "windows only"
macros.  However, I'm not sure how SHPOpenLL would look like.  I assume
you're talking about thinks like CreateFile and CreateFileMapping?  If
so, you can't anymore rely on fread and fseek and things like that.  So,
I'm not sure how a shared implementation would look like.  Helper
functions that call fread or ReadFile depending on some magical parameter?

> Actually, on further consideration this does not take into account the
> need to manipulate filenames to derive the .shp, .shx, .dbf, .cpg.  Grr.
> I'm sort of back at square one, not clear on how to address wide
> character without undue complication. 

Yes, here I had some code duplication.  The filename manipulations would
happen in SHPOpen and SHPOpenW, and would call SHPOpenLL with two open
file handlers.  I don't think there's a nice way around it.

>> (2) Language drivers and code pages DBF:
>>
>> (2.a) DBF Files have a language driver id (LDID) indicating the codec
>> used to store the content.  This is easy, we just need to read the field
>> (an integer) and add it to the struct, possible also adding an accessor
>> function?
>
> Sounds reasonable.
>
>> (2.b) Shapfiles created by ESRI ArcGIS are sometimes accompanied by a
>> .CPG (codepage) file, indicating the codec when the LDID are incapable
>> of identifying it.  This is the only way to support unicode through the
>> UTF-8 codec.  This requires trying to read an additional .CPG file when
>> openening a DBF.  the code page string would also be added to the
>> struct, with possible an accessor as well.
>
> Sounds reasonable.
>
> Is there any way to unify the LDID and CPG handling as far as the
> application is aware?  Instead of application developers having to
> be aware of the distinction?

OK, bear in mind however that supporting the .CPG files requires similar
filename manipulation as for the shapefiles (.SHP and .SHX), so again we
might have some code duplication here ;)

Any unified system would have to rely on strings I guess, as the CPG is
string based.  So we might embed a table that converts the LDID integers
to names, and always return as string: if the CPG is set, return that
one, otherwise return name of LDID. 

>
>> (2.c) When creating a DBF, do we add two parameters to specify the LDID
>> or CodePage?
>>
>> int DBFCreate(const char* pszDBFFile, int iLDID = 0x03, const char*
>> pszCodePage = 0);
>>
>> (I'm assuming here default values are possible in C, I have more of a
>> C++ background =)
>
> The Shapelib API is C callable, so there is no such thing as default
> parameters.  I imagine we will want a DBFCreateEx() function with the
> code page parameters.  As mentioned, if practical, I'd like to unify
> the LDID / code page stuff for the application.
>

Passing unified code pages names _towards_ shapelib might be tricky, as
several code pages go by many names.  A naive way to do it would be to
first check the name against the LDID table.  If there's a hit, use the
LDID to encode the code page.  If the code page is not recognized, use
the .CPG file.  However, if you would use a synonym of a LDID codepage,
it would not be recognized as such and thus be written to the .CPG file
instead of using the LDID.  Also, not all strings written to the .CPG
file will make sense.  I guess ArcGIS (if we want to use that as
reference) will only recognize a few of them.  So, it might be
interesting to limit the possible code page names and return an error if
it is not recognized.

>> This would be the only opportunity to set/change both parameters.  They
>> would be "read only" for the lifetime of the DBF itself.  If we don't
>> need to be able to alter both parameters, we don't need to keep a handle
>> to the code page file around, or even the filename itself (suppose we
>> create a dbf without a codepage first (no .cpg file), and want to set
>> the code page later on.  Then we would need to be able to create that
>> .cpg at the later momemt, and thus we would need to know the filename).
>>
>>
>>
>> Of both issues, I need at least (1) and (2a) implemented to be able to
>> have full Unicode support in pyshapelib (forgive me for focusing on
>> pyshapelib, but that's my point of view here =)  Both are already
>> implemented in the shapelib branch inside the thuban source tree.
>> (2b) and (2c) I can work around by adding specific code to pyshapelib
>> only, which is not yet implemented.
>>
>>
>> If it is OK with you, I'll submit this plan to the shapelib mailing
>> list, and we'll see from there ...
>
> Feedback from the list could be valuable.
>
> If you are ok with focusing on the codepage stuff first, perhaps I can
> come
> up with a plan for IO (and perhaps error reporting) hook functions.
>
That's OK with me.

Bram