[gdal-dev] Assert due to stack corruption in FlatGeoBuf export

Even Rouault even.rouault at spatialys.com
Fri Feb 23 15:42:40 PST 2024


Simon,

did you try to update to the latest 
https://github.com/google/flatbuffers/releases to see if that would 
solve the issue ? If that worked, that would be the best way forward...

Otherwise if the issue persists with the latest flatbuffers release, a 
(admitedly rather tedious) option would be to do a git bisect on the 
flatbuffers code to identify the culprit commit. With some luck, the 
root cause might be obvious if a single culptrit commit can be exhibited 
(perhaps some subtle C++ undefined behaviour triggered? also it is a bit 
mysterious that it hits only for static builds), or otherwise raise to 
the upstream flatbuffers project to ask for their expertise

Even

Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
> I was able to create a fork of 3.7.3 with just the *flatbuffers* 
> replaced with the pre-3.6.x version (2.0.0).
>
> This seemed to only require changes to the version asserts and adding 
> an *align* parameter to *Table::VerifyField()* to match the newer API.
>
> https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0
>
> Our system works correctly and passes all GDAL I/O tests with that 
> version. Obviously this isn't an ideal solution, but this is otherwise 
> a release blocker for us.
>
> I would still very much like to discuss the original problem more 
> deeply, and hopefully come up with a better solution.
>
> Yours hopefully,
>
> Simon
>
>
>
> On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.eves at heavy.ai> wrote:
>
>     Thank you, Robert, for the RR tip. I shall try it.
>
>     I have new findings to report, however.
>
>     First of all, I confirmed that a build against GDAL 3.4.1 (the
>     version we were on before) still works. I also confirmed that
>     builds against 3.7.3 and 3.8.4 still failed even with no
>     additional library dependencies (just sqlite3 and proj), in case
>     it was a side-effect of us also adding more of those. I then tried
>     3.5.3, with the CMake build (same config as we use for 3.7.3) and
>     that worked. I then tried 3.6.4 (again, same CMake config) and
>     that failed. These were all from bundles.
>
>     I then started delving through the GDAL repo itself. I found the
>     common root commit of 3.5.3 and 3.6.4, and all the commits in the
>     *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the
>     final of each. For 3.5.3, this was only two. I built and tested
>     both, and they were fine. I then tried the very first one that was
>     new in the 3.6.4 chain (not in the history of 3.5.3), which was
>     actually a bulk update to the *flatbuffers* sub-library,
>     committed by Bjorn Harrtell on May 8 2022 (SHA f7d8876). That one
>     had the issue. I then tried the immediately-preceding commit (an
>     unrelated docs change) and that one was fine.
>
>     My current hypothesis, therefore, is that the *flatbuffers* update
>     introduced the issue, or at least, the susceptibility of the issue.
>
>     I still cannot explain why it only occurs in an all-static build,
>     and even less able to explain why it only occurs in our full
>     system and not with the simple test program against the very same
>     static lib build that does the very same sequence of GDAL API
>     calls, but I repeated the build tests of the commits either side
>     and a few other random ones a bit further away in each direction,
>     and the results were consistent. Again, it happens with both GCC
>     11 and Clang 14 builds, Debug or Release.
>
>     I will continue tomorrow to look at the actual changes to
>     *flatbuffers* in that update, although they are quite significant.
>     Certainly, the *vector_downward* class, which is directly
>     involved, was a new file in that update (although on inspection of
>     that file's history in the *google/flatbuffers* repo, it seems it
>     was just split out of another header).
>
>     Bjorn, I don't mean to call you out directly, but I am CC'ing you
>     to ensure you see this, as you appear to be a significant
>     contributor to the *flatbuffers* project itself. Any insight you
>     may have would be very welcome. I am of course happy to describe
>     my debugging findings in more detail, privately if you wish,
>     rather than spamming the list.
>
>     Simon
>
>
>
>
>
>
>     On Tue, Feb 20, 2024 at 1:49 PM Robert Coup
>     <robert.coup at koordinates.com> wrote:
>
>         Hi,
>
>         On Tue, 20 Feb 2024 at 21:44, Robert Coup
>         <robert.coup at koordinates.com> wrote:
>
>             Hi Simon,
>
>             On Tue, 20 Feb 2024 at 21:11, Simon Eves
>             <simon.eves at heavy.ai> wrote:
>
>                 Here's the stack trace for the original assert.
>                 Something is stepping on scratch_ to make it
>                 0x1000000000 instead of null, which it starts out as
>                 when the flatbuffer object is created, but by the time
>                 it gets to allocating memory, it's broken.
>
>
>             What happens if you set a watchpoint in gdb when the
>             flatbuffer is created?
>
>             watch -l myfb->scratch
>             or watch *0x1234c0ffee
>
>
>         Or I've also had success with Mozilla's rr:
>         https://rr-project.org/ — you can run to a point where scratch
>         is wrong, set a watchpoint on it, and then run the program
>         backwards to find out what touched it.
>
>         Rob :)
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240224/e377e58d/attachment-0001.htm>


More information about the gdal-dev mailing list