[gdal-dev] Assert due to stack corruption in FlatGeoBuf export

Simon Eves simon.eves at heavy.ai
Fri Feb 23 14:54:28 PST 2024


I was able to create a fork of 3.7.3 with just the *flatbuffers* replaced
with the pre-3.6.x version (2.0.0).

This seemed to only require changes to the version asserts and adding an
*align* parameter to *Table::VerifyField()* to match the newer API.

https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0

Our system works correctly and passes all GDAL I/O tests with that version.
Obviously this isn't an ideal solution, but this is otherwise a release
blocker for us.

I would still very much like to discuss the original problem more deeply,
and hopefully come up with a better solution.

Yours hopefully,

Simon



On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.eves at heavy.ai> wrote:

> Thank you, Robert, for the RR tip. I shall try it.
>
> I have new findings to report, however.
>
> First of all, I confirmed that a build against GDAL 3.4.1 (the version we
> were on before) still works. I also confirmed that builds against 3.7.3 and
> 3.8.4 still failed even with no additional library dependencies (just
> sqlite3 and proj), in case it was a side-effect of us also adding more of
> those. I then tried 3.5.3, with the CMake build (same config as we use for
> 3.7.3) and that worked. I then tried 3.6.4 (again, same CMake config) and
> that failed. These were all from bundles.
>
> I then started delving through the GDAL repo itself. I found the common
> root commit of 3.5.3 and 3.6.4, and all the commits in the
> *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the final
> of each. For 3.5.3, this was only two. I built and tested both, and they
> were fine. I then tried the very first one that was new in the 3.6.4 chain
> (not in the history of 3.5.3), which was actually a bulk update to the
> *flatbuffers* sub-library, committed by Bjorn Harrtell on May 8 2022 (SHA
> f7d8876). That one had the issue. I then tried the immediately-preceding
> commit (an unrelated docs change) and that one was fine.
>
> My current hypothesis, therefore, is that the *flatbuffers* update
> introduced the issue, or at least, the susceptibility of the issue.
>
> I still cannot explain why it only occurs in an all-static build, and even
> less able to explain why it only occurs in our full system and not with the
> simple test program against the very same static lib build that does the
> very same sequence of GDAL API calls, but I repeated the build tests of the
> commits either side and a few other random ones a bit further away in each
> direction, and the results were consistent. Again, it happens with both GCC
> 11 and Clang 14 builds, Debug or Release.
>
> I will continue tomorrow to look at the actual changes to *flatbuffers* in
> that update, although they are quite significant. Certainly, the
> *vector_downward* class, which is directly involved, was a new file in
> that update (although on inspection of that file's history in the
> *google/flatbuffers* repo, it seems it was just split out of another
> header).
>
> Bjorn, I don't mean to call you out directly, but I am CC'ing you to
> ensure you see this, as you appear to be a significant contributor to the
> *flatbuffers* project itself. Any insight you may have would be very
> welcome. I am of course happy to describe my debugging findings in more
> detail, privately if you wish, rather than spamming the list.
>
> Simon
>
>
>
>
>
>
> On Tue, Feb 20, 2024 at 1:49 PM Robert Coup <robert.coup at koordinates.com>
> wrote:
>
>> Hi,
>>
>> On Tue, 20 Feb 2024 at 21:44, Robert Coup <robert.coup at koordinates.com>
>> wrote:
>>
>>> Hi Simon,
>>>
>>> On Tue, 20 Feb 2024 at 21:11, Simon Eves <simon.eves at heavy.ai> wrote:
>>>
>>>> Here's the stack trace for the original assert. Something is stepping
>>>> on scratch_ to make it 0x1000000000 instead of null, which it starts out as
>>>> when the flatbuffer object is created, but by the time it gets to
>>>> allocating memory, it's broken.
>>>>
>>>
>>> What happens if you set a watchpoint in gdb when the flatbuffer is
>>> created?
>>>
>>> watch -l myfb->scratch
>>> or watch *0x1234c0ffee
>>>
>>
>> Or I've also had success with Mozilla's rr: https://rr-project.org/>> you can run to a point where scratch is wrong, set a watchpoint on it, and
>> then run the program backwards to find out what touched it.
>>
>> Rob :)
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240223/585441bb/attachment.htm>


More information about the gdal-dev mailing list