[gdal-dev] Wrapper string encodings are inconsistent
Michael
mbucari1 at gmail.com
Wed Mar 25 12:44:42 PDT 2026
I'm mostly just talking about C#. I did verify that string arrays are UTF-8
for python and Java, but I don't fully understand how single strings are
being marshalled in those languages.
C# is especially problematic because, while the runtime stores strings in
UTF-16, the default interop marshaller will convert strings to ANSI which
can result in data loss. What I would like, at least for C#, is for every
function which returns a string to be decoded with UTF-8, and for every
function which accepts a string argument to be encoded with UTF-8.
There should be a typemap for "char *utf8_or_null" which is allowed to be
null, encodes inputs as UTF-8 and decodes outputs as UTF-8. I added that
in 08d514f25d447c43e0fcd152ae5c453a9b3e2551 for C#, and python already has
"utf8_path_or_none" which is essentially the same thing. Someone needs to
add an equivalent typemap for Java, because the existing java typemap for
"utf8_path" doesn't allow nulls. I would really appreciate help with adding
the java equivalent of "utf8_or_null".
Once the "char *utf8_or_null" typemap exists for all three languages, it
can be applied to all string functions in the way that "char **CSL" is
applied to all string array functions; manually, applied before the
definition and cleared afterwards.
> there is the possibility that some drivers might return strings in a
unknown encoding.
My above-proposed change would not fix the issue of different/unknown
encodings, but it wouldn't make the problem worse, and it would not
preclude addition of functions to get/set raw binary values.
How does that sound? Would you be open to that change? I think it would be
a great improvement on the current paradigm. I volunteer to do all the
changes to %apply and %clear the typemaps, I just need help from someone to
make a Java equivalent of "char *utf8_or_null".
On Wed, Mar 25, 2026 at 1:11 PM Even Rouault <even.rouault at spatialys.com>
wrote:
> Hi Michael,
>
> I assume you're talking about the C# or Java bindings . In the case of the
> Python bindings, given the dynamic typing of the language, the typemap code
> tries to convert to UTF-8 when possible or return a bytes if not, and is
> also tolerant on if it receives a Unicode string or a bytes as input.
>
> The issue is that even if *nominally* exchanges in the GDAL API are
> supposed to be in UTF-8, there is the possibility that some drivers might
> return strings in a unknown encoding. That could be CSV for example, or
> shapefiles or mapinfo files whose declared encoding is not understood by
> GDAL. I believe there is a ticket about the possibility of creating 2
> variants of the SWIG methods for which that could occur: one with UTF-8,
> one with a binary type. Actually that might be this PR
> https://github.com/OSGeo/gdal/pull/3825 that got stale.
>
> Even
> Le 25/03/2026 à 20:00, Michael via gdal-dev a écrit :
>
> Every function which returns char** has the "char **CSL" typemap applied,
> which causes strings in the returned array to be decoded with UTF-8.
>
> Every function which accepts a char** parameter has either the "char
> **options", "char **dict", or "char **dictAndCSLDestroy" typemap applied,
> which causes strings in the parameter's array to be encoded with UTF-8.
>
> However, many functions which return a single string value or accept
> single strings as arguments do not use UTF-8 encoding. This causes several
> inconsistencies in the wrapper's behavior.
>
> For example, many times string values from string arrays which are UTF-8
> are used in other functions which are not UTF-8.
>
> Some examples:
> - AlgorithmRegistry.GetAlgNames() returns a string array of algorithm
> names decoded with UTF-8, but AlgorithmRegistry.InstantiateAlg(string
> algName) does not encode algName with UTF-8.
> - Algorithm.GetArgNames() returns a string array of argument names decoded
> with UTF-8, but Algorithm.GetArg(string argName) does not encode argName
> with UTF-8.
> - GeomCoordinatePrecision.GetFormats() returns a string array of format
> names decoded with UTF-8, but
> GeomCoordinatePrecision.GetFormatSpecificOptions(string formatName) does
> not encode formatName with UTF-8.
>
> Also, some functions which return a string array have related functions
> which return a single string value, but the strings in the array are
> encoded with UTF-8 while the single string values are not. For example,
> AlgorithmArg.GetAsStringList() returns an array of strings decoded with
> UTF-8, but AlgorithmArg.GetAsString() does not decode its returned string
> with UTF-8.
>
> And finally, many other string functions which accept or return strings
> not encoded with UTF-8 probably _should be UTF-8_.
>
> Some examples:
> - Any "Get*Name" function or "name" property
> - Any "Get*Description" function
> - Any "Create*", "Delete*", or "Get*" function which accepts a "*name"
> parameter
>
> Really, are there _any_ strings which _shouldn't_ be encoded with UTF-8? I
> can't find a single reason why every string passed to the wrapper should
> not be encoded as UTF-8, and no reason why every string retrieved from the
> wrapper should not be decoded with UTF-8.
>
> --
> Michael Bucari
>
> _______________________________________________
> gdal-dev mailing listgdal-dev at lists.osgeo.orghttps://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> -- http://www.spatialys.com
> My software is free, but my time generally not.
>
>
--
Michael Bucari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20260325/92e7c755/attachment-0001.htm>
More information about the gdal-dev
mailing list