[gdal-dev] RFC 30: Unicode Filenames - call for discussion

Ari Jolma ari.jolma at gmail.com
Wed Sep 15 14:37:18 EDT 2010


On 09/15/2010 02:23 PM, Mark Overmeer wrote:
> * Ari Jolma (ari.jolma at gmail.com) [100915 10:49]:
>    
>> On 09/15/2010 06:22 AM, Frank Warmerdam wrote:
>>      
>>> A client has asked me to support unicode filenames on windows.  To
>>> that end
>>> I have constructed an RFC for migration to treating all filesnames in the
>>> GDAL API as utf-8.
>>>
>>>   http://trac.osgeo.org/gdal/wiki/rfc30_utf8_filenames
>>>
>>> I'd appreciate review and comment.  If all is well I hope to call
>>> for a vote on this RFC late this week.
>>>        
>> My observation is that I can open data sources with non-ascii
>> filenames in Linux but not in Windows using the Perl bindings.
>> There's a bug in the Perl bindings to tell Perl that those same
>> filenames when read back from GDAL are (I guess) utf-8.
>>      
> In UNIX/Linux, the charset used for filenames can differ per filesystem.
> This means, in practice, that the charset is undefined; sometimes you
> can find the charset for a filesys in /etc/fstab, sometimes only in the
> filesys documentation. There is no systemcall which can tell you that.
> It would be a nice addition to statfs().
> What if you move a file between filesystems with different encodings?
>
>    
>> Does Windows use utf-8 for filenames? If so, then fixing the back to
>> utf-8 bug, would also work for windows, I guess.
>>      
> WINDOWS uses UTF16 with a subset of Unicode (all chars are two
> bytes). See http://en.wikipedia.org/wiki/NTFS
>
> Perl treats filenames as sequence of bytes, where a [/\:] have
> a special meaning. You cannot convert filenames safely into utf8,
> because they may already be in utf8 (or something else than latin1)
>    

The idea of this RFC as I understand it is to build a layer into GDAL, 
which would take care of conversions between utf-8 and utf-16 (Windows 
end) transparently, thus making it similar to the current case of utf-8 
filesystem in unix. Everything should work fine as it is now, but I'll 
add encode (to utf8 by default) to be on the safe side.

In the case of unix with non utf8 filesystem determining the filename 
encoding is left for the user. The encoding is by default utf8 but can 
be changed.

Ari



More information about the gdal-dev mailing list