[gdal-dev] GDAL/OGR C# wrapper and UTF8

Dennis Gocke dengo at gmx.net
Tue Apr 2 07:41:53 PDT 2013


Thanks for the quick reply.

Let's back up a bit. Just take the following two methods of the Feature class in the OGR wrapper:
public string GetFieldAsString(int id);
public void SetField(int id, string value);

For these two the wrapper also uses no special encoding, which means that ANSI is used when strings are marshaled from/to unmanaged code.

As I understood it, this is incorrect for some time now. As stated in http://trac.osgeo.org/gdal/wiki/rfc23_ogr_unicode:
"It is declared that OGR string attribute values will be in UTF-8. This means that OGR drivers are responsible for translating format specific representations to UTF-8 when reading, and back to the format specific representation when writing."

I created a workaround for these two methods in the wrapper for myself many months ago, that's why I forgot to mention these in my previous post.

Only if we agree that the wrapper is doing it wrong for these two essential methods it makes sense to discuss the other methods.
For instance if you take the Attribute Filter: Assuming you have a feature with attribute Name='München'. Because GetFieldAsString uses wrong Encoding you will get 'München'. Using this for an Attribute Filter "Name = 'München'" will work, because SetAttributeFilter also uses ANSI encoding, meaning that the problem will cancel itself out. But if GetFieldAsString would correctly use UTF8 encoding, it only makes sense that SetAttributeFilter would also use UTF8 encoding, otherwise "Name = 'München'" will obviously not work.

I somehow hoped or assumed that this 'redesign' mentioned in rfc23 has progressed since then, so that a consistent encoding is also used for the other methods.

And it seemed to have apart from the C# wrapper.

I might be wrong, but if I am I still think it would be greatly beneficial if a consistent encoding is also used for Field Names, Attibute Filter, SQL statements, etc...

Regarding marshaling strings. Marshal.Copy and pinning is not explicitly needed, because when managed arrays are marshaled they are automatically pinned.
So just converting strings to UTF8 encoded managed byte arrays (byte[]) and passing the array instead of the string will work fine.
One just needs to be careful that the byte array needs to be zero-terminated.
This is what I use:
        public static byte[] StringToUtf8Bytes(string str)
        {
            if (str == null)
            {
                return null;
            }

            Encoding encoder = Encoding.UTF8;
            int strLen = str.Length;
            int nativeLength = encoder.GetMaxByteCount(strLen);
            byte[] bytes = new byte[nativeLength + 1]; // zero terminated
            encoder.GetBytes(str, 0, str.Length, bytes, 0);
            return bytes;
        }
		
As far as examples go: Any Shapefile should do. Although the DBF might be in different encodings OGR internally always converts the field attribute strings to UTF8, but the C# wrapper then interprets the strings as being in ANSI.
(Funny enough when you have an ANSI encoded DBF and set the Config Option SHAPE_ENCODING to UTF8 the C# wrapper will 'correct' the mistake done by the Config Option which forces the wrong encoding in this case. Before I understood that the problem was actually the C# wrapper this was driving me crazy.)
		
Best regards,
Dennis


More information about the gdal-dev mailing list