[gdal-dev] Assert due to stack corruption in FlatGeoBuf export

Tue Feb 20 13:12:23 PST 2024

gdb) p m_fbb
$5 = (flatbuffers::FlatBufferBuilder &) @0x7fffffff9070: {static
kFileIdentifierLength = 4, buf_ = {allocator_ = 0x0, own_allocator_ =
false, initial_size_ = 1024, buffer_minalign_ = 8, reserved_ = 0, size_ =
0, buf_ = 0x0, cur_ = 0x0,
    scratch_ = 0x1000000000000 <error: Cannot access memory at address
0x1000000000000>}, num_field_loc = 8, max_voffset_ = 0, nested = false,
finished = false, minalign_ = 8, force_defaults_ = false, dedup_vtables_ =
true, string_pool = 0x0}

On Tue, Feb 20, 2024 at 1:10 PM Simon Eves <simon.eves at heavy.ai> wrote:

> Here's the stack trace for the original assert. Something is stepping on
> scratch_ to make it 0x1000000000 instead of null, which it starts out as
> when the flatbuffer object is created, but by the time it gets to
> allocating memory, it's broken.
>
> On Tue, Feb 20, 2024 at 1:05 PM Simon Eves <simon.eves at heavy.ai> wrote:
>
>> (starting a new thread to avoid derailing the static-build one any
>> further)
>>
>> Totally agreed on the mismatch idea, but the code in question is all
>> self-contained down in *ogr/ogrsf_frmts/flatgeobuf* and the *flatbuffers*
>> sub-project (which is a snapshot of a Google OSS project) so I'm struggling
>> to see how there could be a mismatch.
>>
>> Also, although we're building on CentOS 7, we're using relatively new
>> compilers (GCC 11.4 and Clang 14.0.6), and we bundle the matching newer
>> runtimes.
>>
>> We don't have a full static build stack on our normal dev platform
>> (Ubuntu 22.04) so I haven't been able to repro the problem there.
>>
>> I should have mentioned the first time that we have tried using ASAN, and
>> it definitely catches something wrong, but the behavior is different, and
>> varies if you add more debug printfs. For example:
>>
>> DEBUG: vector_downward::push() num = 16
>> DEBUG: about to reallocate, buf_ = 0, cur_ = 0, scratch = 0
>> DEBUG: reallocated, buf_ = 0x61900062d380, cur_ = 0x61900062cf80, scratch
>> = 0
>> DEBUG: vector_downward::push() ptr = 0x61900062cf70, about to do memcpy
>> =================================================================
>> ==25459==ERROR: AddressSanitizer: heap-buffer-overflow on address
>> 0x61900062cf70 at pc 0x7f8933eb87f6 bp 0x7fffa7aa0e70 sp 0x7fffa7aa0620
>> WRITE of size 16 at 0x61900062cf70 thread T0
>>
>> ...but it's still not obvious what exactly is going wrong. The code and
>> data flow makes perfect sense when you step through it in a dynamic build
>> that doesn't fail.
>>
>> Like I said, the frustrating part is that a simple test program
>> (attached) compiled against the same set of static libs works fine.
>>
>> S
>>
>> On Tue, Feb 20, 2024 at 12:33 PM Robert Coup <robert.coup at koordinates.com>
>> wrote:
>>
>>> Hi Simon,
>>>
>>> On Tue, 20 Feb 2024 at 18:58, Simon Eves via gdal-dev <
>>> gdal-dev at lists.osgeo.org> wrote:
>>>
>>>> We still have one VERY strange issue whereby FlatGeoBuf export fails in
>>>> a very consistent and reproducible form down in the flatbuffer code, but
>>>> only in the static build, and only in the full system. I have written a
>>>> simple test harness that links the very same static libgdal and does a
>>>> simple GDAL startup and FGB export of a single feature and that works fine.
>>>> It's some kind of data/stack corruption when it first tries to write to the
>>>> flatbuffer on the first feature, which results in a pointer member of the
>>>> buffer class becoming 0x100000000000 (always) instead of null, and then it
>>>> stops on an assert. There is also one private function in the
>>>> vector_downward class which the debugger won't even step into in that
>>>> build.  I can even put printfs in that function and they don't come out.
>>>> I've tried it on CentOS and on Ubuntu, with GCC and Clang, and it's always
>>>> the same. Everything else in GDAL works just fine (we have LOTS of
>>>> import/export unit tests). This makes zero sense as all the FGB code is
>>>> internal to GDAL and compiled together. I've been poking at it for over a
>>>> week and it's doing my head in.
>>>>
>>>
>>> One cause of this sort of crash is a header/library mismatch somewhere
>>> where a function is expecting different parameters/types than the caller is
>>> actually providing. Otherwise, maybe a bug in glibc/libstdc++/gcc/something
>>> that's been fixed in the intervening ten years since CentOS 7 was released?
>>>
>>>
>>> If you run your *build* on a modern distro/libc/gcc/etc does it change
>>> things? If it's the same, maybe hints more towards the former.
>>>
>>> ASAN (https://github.com/google/sanitizers/wiki/AddressSanitizer) might
>>> help track down stack/heap corruption.
>>>
>>> Rob :)
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240220/b62d557f/attachment.htm>