[gdal-dev] Assert due to stack corruption in FlatGeoBuf export
Even Rouault
even.rouault at spatialys.com
Fri Feb 23 15:42:40 PST 2024
Simon,
did you try to update to the latest
https://github.com/google/flatbuffers/releases to see if that would
solve the issue ? If that worked, that would be the best way forward...
Otherwise if the issue persists with the latest flatbuffers release, a
(admitedly rather tedious) option would be to do a git bisect on the
flatbuffers code to identify the culprit commit. With some luck, the
root cause might be obvious if a single culptrit commit can be exhibited
(perhaps some subtle C++ undefined behaviour triggered? also it is a bit
mysterious that it hits only for static builds), or otherwise raise to
the upstream flatbuffers project to ask for their expertise
Even
Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
> I was able to create a fork of 3.7.3 with just the *flatbuffers*
> replaced with the pre-3.6.x version (2.0.0).
>
> This seemed to only require changes to the version asserts and adding
> an *align* parameter to *Table::VerifyField()* to match the newer API.
>
> https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0
>
> Our system works correctly and passes all GDAL I/O tests with that
> version. Obviously this isn't an ideal solution, but this is otherwise
> a release blocker for us.
>
> I would still very much like to discuss the original problem more
> deeply, and hopefully come up with a better solution.
>
> Yours hopefully,
>
> Simon
>
>
>
> On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.eves at heavy.ai> wrote:
>
> Thank you, Robert, for the RR tip. I shall try it.
>
> I have new findings to report, however.
>
> First of all, I confirmed that a build against GDAL 3.4.1 (the
> version we were on before) still works. I also confirmed that
> builds against 3.7.3 and 3.8.4 still failed even with no
> additional library dependencies (just sqlite3 and proj), in case
> it was a side-effect of us also adding more of those. I then tried
> 3.5.3, with the CMake build (same config as we use for 3.7.3) and
> that worked. I then tried 3.6.4 (again, same CMake config) and
> that failed. These were all from bundles.
>
> I then started delving through the GDAL repo itself. I found the
> common root commit of 3.5.3 and 3.6.4, and all the commits in the
> *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the
> final of each. For 3.5.3, this was only two. I built and tested
> both, and they were fine. I then tried the very first one that was
> new in the 3.6.4 chain (not in the history of 3.5.3), which was
> actually a bulk update to the *flatbuffers* sub-library,
> committed by Bjorn Harrtell on May 8 2022 (SHA f7d8876). That one
> had the issue. I then tried the immediately-preceding commit (an
> unrelated docs change) and that one was fine.
>
> My current hypothesis, therefore, is that the *flatbuffers* update
> introduced the issue, or at least, the susceptibility of the issue.
>
> I still cannot explain why it only occurs in an all-static build,
> and even less able to explain why it only occurs in our full
> system and not with the simple test program against the very same
> static lib build that does the very same sequence of GDAL API
> calls, but I repeated the build tests of the commits either side
> and a few other random ones a bit further away in each direction,
> and the results were consistent. Again, it happens with both GCC
> 11 and Clang 14 builds, Debug or Release.
>
> I will continue tomorrow to look at the actual changes to
> *flatbuffers* in that update, although they are quite significant.
> Certainly, the *vector_downward* class, which is directly
> involved, was a new file in that update (although on inspection of
> that file's history in the *google/flatbuffers* repo, it seems it
> was just split out of another header).
>
> Bjorn, I don't mean to call you out directly, but I am CC'ing you
> to ensure you see this, as you appear to be a significant
> contributor to the *flatbuffers* project itself. Any insight you
> may have would be very welcome. I am of course happy to describe
> my debugging findings in more detail, privately if you wish,
> rather than spamming the list.
>
> Simon
>
>
>
>
>
>
> On Tue, Feb 20, 2024 at 1:49 PM Robert Coup
> <robert.coup at koordinates.com> wrote:
>
> Hi,
>
> On Tue, 20 Feb 2024 at 21:44, Robert Coup
> <robert.coup at koordinates.com> wrote:
>
> Hi Simon,
>
> On Tue, 20 Feb 2024 at 21:11, Simon Eves
> <simon.eves at heavy.ai> wrote:
>
> Here's the stack trace for the original assert.
> Something is stepping on scratch_ to make it
> 0x1000000000 instead of null, which it starts out as
> when the flatbuffer object is created, but by the time
> it gets to allocating memory, it's broken.
>
>
> What happens if you set a watchpoint in gdb when the
> flatbuffer is created?
>
> watch -l myfb->scratch
> or watch *0x1234c0ffee
>
>
> Or I've also had success with Mozilla's rr:
> https://rr-project.org/ — you can run to a point where scratch
> is wrong, set a watchpoint on it, and then run the program
> backwards to find out what touched it.
>
> Rob :)
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240224/e377e58d/attachment-0001.htm>
More information about the gdal-dev
mailing list