[gdal-dev] Assert due to stack corruption in FlatGeoBuf export
Simon Eves
simon.eves at heavy.ai
Fri Feb 23 15:46:00 PST 2024
Our emails crossed. I am indeed testing with the latest flatbuffers now too.
Agreed on the rest.
On Fri, Feb 23, 2024 at 3:42 PM Even Rouault <even.rouault at spatialys.com>
wrote:
> Simon,
>
> did you try to update to the latest
> https://github.com/google/flatbuffers/releases to see if that would solve
> the issue ? If that worked, that would be the best way forward...
>
> Otherwise if the issue persists with the latest flatbuffers release, a
> (admitedly rather tedious) option would be to do a git bisect on the
> flatbuffers code to identify the culprit commit. With some luck, the root
> cause might be obvious if a single culptrit commit can be exhibited
> (perhaps some subtle C++ undefined behaviour triggered? also it is a bit
> mysterious that it hits only for static builds), or otherwise raise to the
> upstream flatbuffers project to ask for their expertise
>
> Even
> Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
>
> I was able to create a fork of 3.7.3 with just the *flatbuffers* replaced
> with the pre-3.6.x version (2.0.0).
>
> This seemed to only require changes to the version asserts and adding an
> *align* parameter to *Table::VerifyField()* to match the newer API.
>
>
> https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0
>
> Our system works correctly and passes all GDAL I/O tests with that
> version. Obviously this isn't an ideal solution, but this is otherwise a
> release blocker for us.
>
> I would still very much like to discuss the original problem more deeply,
> and hopefully come up with a better solution.
>
> Yours hopefully,
>
> Simon
>
>
>
> On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.eves at heavy.ai> wrote:
>
>> Thank you, Robert, for the RR tip. I shall try it.
>>
>> I have new findings to report, however.
>>
>> First of all, I confirmed that a build against GDAL 3.4.1 (the version we
>> were on before) still works. I also confirmed that builds against 3.7.3 and
>> 3.8.4 still failed even with no additional library dependencies (just
>> sqlite3 and proj), in case it was a side-effect of us also adding more of
>> those. I then tried 3.5.3, with the CMake build (same config as we use for
>> 3.7.3) and that worked. I then tried 3.6.4 (again, same CMake config) and
>> that failed. These were all from bundles.
>>
>> I then started delving through the GDAL repo itself. I found the common
>> root commit of 3.5.3 and 3.6.4, and all the commits in the
>> *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the final
>> of each. For 3.5.3, this was only two. I built and tested both, and they
>> were fine. I then tried the very first one that was new in the 3.6.4 chain
>> (not in the history of 3.5.3), which was actually a bulk update to the
>> *flatbuffers* sub-library, committed by Bjorn Harrtell on May 8 2022
>> (SHA f7d8876). That one had the issue. I then tried the
>> immediately-preceding commit (an unrelated docs change) and that one was
>> fine.
>>
>> My current hypothesis, therefore, is that the *flatbuffers* update
>> introduced the issue, or at least, the susceptibility of the issue.
>>
>> I still cannot explain why it only occurs in an all-static build, and
>> even less able to explain why it only occurs in our full system and not
>> with the simple test program against the very same static lib build that
>> does the very same sequence of GDAL API calls, but I repeated the build
>> tests of the commits either side and a few other random ones a bit further
>> away in each direction, and the results were consistent. Again, it happens
>> with both GCC 11 and Clang 14 builds, Debug or Release.
>>
>> I will continue tomorrow to look at the actual changes to *flatbuffers* in
>> that update, although they are quite significant. Certainly, the
>> *vector_downward* class, which is directly involved, was a new file in
>> that update (although on inspection of that file's history in the
>> *google/flatbuffers* repo, it seems it was just split out of another
>> header).
>>
>> Bjorn, I don't mean to call you out directly, but I am CC'ing you to
>> ensure you see this, as you appear to be a significant contributor to the
>> *flatbuffers* project itself. Any insight you may have would be very
>> welcome. I am of course happy to describe my debugging findings in more
>> detail, privately if you wish, rather than spamming the list.
>>
>> Simon
>>
>>
>>
>>
>>
>>
>> On Tue, Feb 20, 2024 at 1:49 PM Robert Coup <robert.coup at koordinates.com>
>> wrote:
>>
>>> Hi,
>>>
>>> On Tue, 20 Feb 2024 at 21:44, Robert Coup <robert.coup at koordinates.com>
>>> wrote:
>>>
>>>> Hi Simon,
>>>>
>>>> On Tue, 20 Feb 2024 at 21:11, Simon Eves <simon.eves at heavy.ai> wrote:
>>>>
>>>>> Here's the stack trace for the original assert. Something is stepping
>>>>> on scratch_ to make it 0x1000000000 instead of null, which it starts out as
>>>>> when the flatbuffer object is created, but by the time it gets to
>>>>> allocating memory, it's broken.
>>>>>
>>>>
>>>> What happens if you set a watchpoint in gdb when the flatbuffer is
>>>> created?
>>>>
>>>> watch -l myfb->scratch
>>>> or watch *0x1234c0ffee
>>>>
>>>
>>> Or I've also had success with Mozilla's rr: https://rr-project.org/ —
>>> you can run to a point where scratch is wrong, set a watchpoint on it, and
>>> then run the program backwards to find out what touched it.
>>>
>>> Rob :)
>>>
>>
> _______________________________________________
> gdal-dev mailing listgdal-dev at lists.osgeo.orghttps://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> -- http://www.spatialys.com
> My software is free, but my time generally not.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240223/b90f058d/attachment.htm>
More information about the gdal-dev
mailing list