<div dir="ltr"><div>I'm mostly just talking about C#. I did verify that string arrays are UTF-8 for python and Java, but I don't fully understand how single strings are being marshalled in those languages.</div><div><br></div><div>C# is especially problematic because, while the runtime stores strings in UTF-16, the default interop marshaller will convert strings to ANSI which can result in data loss. What I would like, at least for C#, is for every function which returns a string to be decoded with UTF-8, and for every function which accepts a string argument to be encoded with UTF-8.</div><div><br></div><div>There should be a typemap for "char *utf8_or_null" which is allowed to be null, encodes inputs as UTF-8 and decodes outputs as UTF-8. I added that in 08d514f25d447c43e0fcd152ae5c453a9b3e2551 for C#, and python already has "utf8_path_or_none" which is essentially the same thing. Someone needs to add an equivalent typemap for Java, because the existing java typemap for "utf8_path" doesn't allow nulls. I would really appreciate help with adding the java equivalent of "utf8_or_null".</div><div><br></div><div>Once the "char *utf8_or_null" typemap exists for all three languages, it can be applied to all string functions in the way that "char **CSL" is applied to all string array functions; manually, applied before the definition and cleared afterwards.</div><div><br></div><div>> there is the possibility that some
drivers might return strings in a unknown encoding.</div><div><br></div><div>My above-proposed change would not fix the issue of different/unknown encodings, but it wouldn't make the problem worse, and it would not preclude addition of functions to get/set raw binary values.</div><div>
<div><br></div><div>How does that sound? Would you be open to that change? I think it would be a great improvement on the current paradigm. I volunteer to do all the changes to %apply and %clear the typemaps, I just need help from someone to make a Java equivalent of
"char *utf8_or_null".</div>
<br></div><br><div><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Wed, Mar 25, 2026 at 1:11 PM Even Rouault <<a href="mailto:even.rouault@spatialys.com">even.rouault@spatialys.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>
<div>
<p>Hi Michael,</p>
<p>I assume you're talking about the C# or Java bindings . In the
case of the Python bindings, given the dynamic typing of the
language, the typemap code tries to convert to UTF-8 when possible
or return a bytes if not, and is also tolerant on if it receives a
Unicode string or a bytes as input.</p>
<p>The issue is that even if *nominally* exchanges in the GDAL API
are supposed to be in UTF-8, there is the possibility that some
drivers might return strings in a unknown encoding. That could be
CSV for example, or shapefiles or mapinfo files whose declared
encoding is not understood by GDAL. I believe there is a ticket
about the possibility of creating 2 variants of the SWIG methods
for which that could occur: one with UTF-8, one with a binary
type. Actually that might be this PR
<a href="https://github.com/OSGeo/gdal/pull/3825" target="_blank">https://github.com/OSGeo/gdal/pull/3825</a> that got stale.</p>
<p>Even</p>
<div>Le 25/03/2026 à 20:00, Michael via
gdal-dev a écrit :<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Every function which returns char** has the "char **CSL"
typemap applied, which causes strings in the returned array to
be decoded with UTF-8.</div>
<div><br>
</div>
<div>Every function which accepts a char** parameter has either
the "char **options", "char **dict", or "char
**dictAndCSLDestroy" typemap applied, which causes strings in
the parameter's array to be encoded with UTF-8.</div>
<div><br>
</div>
<div>However, many functions which return a single string value
or accept single strings as arguments do not use UTF-8
encoding. This causes several inconsistencies in the wrapper's
behavior.</div>
<div><br>
</div>
<div>For example, many times string values from string arrays
which are UTF-8 are used in other functions which are not
UTF-8.</div>
<div><br>
</div>
<div>Some examples:</div>
<div>- AlgorithmRegistry.GetAlgNames() returns a string array of
algorithm names decoded with UTF-8, but
AlgorithmRegistry.InstantiateAlg(string algName) does not
encode algName with UTF-8.</div>
<div>- Algorithm.GetArgNames() returns a string array of
argument names decoded with UTF-8, but Algorithm.GetArg(string
argName) does not encode argName with UTF-8.</div>
<div>- GeomCoordinatePrecision.GetFormats() returns a string
array of format names decoded with UTF-8, but
GeomCoordinatePrecision.GetFormatSpecificOptions(string
formatName) does not encode formatName with UTF-8.</div>
<div><br>
</div>
<div>Also, some functions which return a string array have
related functions which return a single string value, but the
strings in the array are encoded with UTF-8 while the single
string values are not. For example,
AlgorithmArg.GetAsStringList() returns an array of strings
decoded with UTF-8, but AlgorithmArg.GetAsString() does not
decode its returned string with UTF-8.</div>
<div><br>
</div>
<div>And finally, many other string functions which accept or
return strings not encoded with UTF-8 probably _should be
UTF-8_.</div>
<div><br>
</div>
<div>Some examples:</div>
<div>- Any "Get*Name" function or "name" property</div>
<div>- Any "Get*Description" function</div>
<div>- Any "Create*", "Delete*", or "Get*" function which
accepts a "*name" parameter</div>
<div><br>
</div>
<div>Really, are there _any_ strings which _shouldn't_ be
encoded with UTF-8? I can't find a single reason why every
string passed to the wrapper should not be encoded as UTF-8,
and no reason why every string retrieved from the wrapper
should not be decoded with UTF-8.</div>
<div><br>
</div>
<span class="gmail_signature_prefix">-- </span><br>
<div dir="ltr" class="gmail_signature">Michael Bucari</div>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
gdal-dev mailing list
<a href="mailto:gdal-dev@lists.osgeo.org" target="_blank">gdal-dev@lists.osgeo.org</a>
<a href="https://lists.osgeo.org/mailman/listinfo/gdal-dev" target="_blank">https://lists.osgeo.org/mailman/listinfo/gdal-dev</a>
</pre>
</blockquote>
<pre cols="72">--
<a href="http://www.spatialys.com" target="_blank">http://www.spatialys.com</a>
My software is free, but my time generally not.</pre>
</div>
</blockquote></div><div><br clear="all"></div><br><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature">Michael Bucari</div>