[gdal-dev] Assert due to stack corruption in FlatGeoBuf export
Simon Eves
simon.eves at heavy.ai
Thu Feb 22 22:22:44 PST 2024
Thank you, Robert, for the RR tip. I shall try it.
I have new findings to report, however.
First of all, I confirmed that a build against GDAL 3.4.1 (the version we
were on before) still works. I also confirmed that builds against 3.7.3 and
3.8.4 still failed even with no additional library dependencies (just
sqlite3 and proj), in case it was a side-effect of us also adding more of
those. I then tried 3.5.3, with the CMake build (same config as we use for
3.7.3) and that worked. I then tried 3.6.4 (again, same CMake config) and
that failed. These were all from bundles.
I then started delving through the GDAL repo itself. I found the common
root commit of 3.5.3 and 3.6.4, and all the commits in the
*ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the final of
each. For 3.5.3, this was only two. I built and tested both, and they were
fine. I then tried the very first one that was new in the 3.6.4 chain (not
in the history of 3.5.3), which was actually a bulk update to the
*flatbuffers* sub-library, committed by Bjorn Harrtell on May 8 2022 (SHA
f7d8876). That one had the issue. I then tried the immediately-preceding
commit (an unrelated docs change) and that one was fine.
My current hypothesis, therefore, is that the *flatbuffers* update
introduced the issue, or at least, the susceptibility of the issue.
I still cannot explain why it only occurs in an all-static build, and even
less able to explain why it only occurs in our full system and not with the
simple test program against the very same static lib build that does the
very same sequence of GDAL API calls, but I repeated the build tests of the
commits either side and a few other random ones a bit further away in each
direction, and the results were consistent. Again, it happens with both GCC
11 and Clang 14 builds, Debug or Release.
I will continue tomorrow to look at the actual changes to *flatbuffers* in
that update, although they are quite significant. Certainly, the
*vector_downward* class, which is directly involved, was a new file in that
update (although on inspection of that file's history in the
*google/flatbuffers* repo, it seems it was just split out of another
header).
Bjorn, I don't mean to call you out directly, but I am CC'ing you to ensure
you see this, as you appear to be a significant contributor to the
*flatbuffers* project itself. Any insight you may have would be very
welcome. I am of course happy to describe my debugging findings in more
detail, privately if you wish, rather than spamming the list.
Simon
On Tue, Feb 20, 2024 at 1:49 PM Robert Coup <robert.coup at koordinates.com>
wrote:
> Hi,
>
> On Tue, 20 Feb 2024 at 21:44, Robert Coup <robert.coup at koordinates.com>
> wrote:
>
>> Hi Simon,
>>
>> On Tue, 20 Feb 2024 at 21:11, Simon Eves <simon.eves at heavy.ai> wrote:
>>
>>> Here's the stack trace for the original assert. Something is stepping on
>>> scratch_ to make it 0x1000000000 instead of null, which it starts out as
>>> when the flatbuffer object is created, but by the time it gets to
>>> allocating memory, it's broken.
>>>
>>
>> What happens if you set a watchpoint in gdb when the flatbuffer is
>> created?
>>
>> watch -l myfb->scratch
>> or watch *0x1234c0ffee
>>
>
> Or I've also had success with Mozilla's rr: https://rr-project.org/ — you
> can run to a point where scratch is wrong, set a watchpoint on it, and then
> run the program backwards to find out what touched it.
>
> Rob :)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240222/0bd608f0/attachment-0001.htm>
More information about the gdal-dev
mailing list