[gdal-dev] Assert due to stack corruption in FlatGeoBuf export

Even Rouault even.rouault at spatialys.com
Sun Feb 25 15:42:59 PST 2024


>
> Not obvious why that change would have broken anything, and certainly 
> still absolutely no idea why it only happens in a full static build.

At that point, I would slightly bet on the fact that your whole 
application would have another component using flatbuffers at a 
different version, which wouldn't have the new vector_downward::size_ 
member. Although I would expect that static linking would be in a better 
position to detect duplicated symbols than dynamic linking...

One thing we didn't do in GDAL is to add a GDAL specific namespace 
around its flatbuffers component (we did that in MapServer to avoid 
potential conflicts between MapServer's flatbuffers copy with the GDAL one)

An interesting experiment would be to revert 
https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688 
but add a unused size_ member to see if that's enough to break things.  
Or just scrumble a bit the order of members of vector_downward.

Or try replacing the "flatbuffers" namespace by something like 
"gdal_flatbuffers"


> Simon
>
>
>
> On Sat, Feb 24, 2024 at 5:27 PM Simon Eves <simon.eves at heavy.ai> wrote:
>
>     OK, so I tried a custom build of 3.7.3 with the latest
>     *flatbuffers* (23.5.26), which was a drop-in replacement for 2.0.6
>     other than the version asserts.
>
>     This does not exhibit the original problem either.
>
>     However, while it produces files which the stock static build, the
>     static build with the older *flatbuffers* (2.0.0), and the Ubuntu
>     dynamic build, can all read just fine, it is unable to read ANY
>     files back in again (in the context of our server geo importer,
>     anyway).
>
>     GDAL throws a *CE_Failure* of *Header failed consistency
>     verification (1), *which is from *OGRFlatGeobufLayer::Open(),* and
>     the dataset reports no layers (or at least, no vector layers).
>
>     This also appears to be a side-effect of it being a static build,
>     as *ogrinfo* built from the same source (with *flatbuffers*
>     2.0.0), but in regular shared libs mode, can read all three files
>     just fine. I have been unable to achieve a full-static tools
>     build, so I can't try that right now.
>
>     This either means that the problem is still there in some form in
>     the latest *flatbuffers*, but has moved, or that the higher-level
>     FGB file schema verification can be affected by the *flatbuffers*
>     version. Both are equally concerning.
>
>     Anyway, the build with the older *flatbuffers* 2.0.0 extracted
>     from the v3.5.3 tree (with the *Table::VerifyField* mod) seems to
>     work fine in all ways, so we're probably gonna go with that, in
>     the absence of anything else.
>
>     One other weirdness is that, of the three files, the two produced
>     by the static builds (*flatbuffers* 2.0.0 and *flatbuffers*
>     23.5.26) are 16 bytes longer than the one from the Ubuntu dynamic
>     build. All three read just fine with *ogrinfo* and our server geo
>     importer, and result in the same table. Here is a link to a bundle
>     with all three files plus the GeoJSON equivalent (*MULTIPOLYGON*
>     US states with some metadata).
>
>     https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing
>
>     As ever, happy to get into the weeds with more details of the
>     original problem, but pretty sure that 95% of the readers of this
>     list don't want this thread to get any longer! :)
>
>     Simon
>
>
>     On Fri, Feb 23, 2024 at 3:46 PM Simon Eves <simon.eves at heavy.ai>
>     wrote:
>
>         Our emails crossed. I am indeed testing with the latest
>         flatbuffers now too.
>
>         Agreed on the rest.
>
>         On Fri, Feb 23, 2024 at 3:42 PM Even Rouault
>         <even.rouault at spatialys.com> wrote:
>
>             Simon,
>
>             did you try to update to the latest
>             https://github.com/google/flatbuffers/releases to see if
>             that would solve the issue ? If that worked, that would be
>             the best way forward...
>
>             Otherwise if the issue persists with the latest
>             flatbuffers release, a (admitedly rather tedious) option
>             would be to do a git bisect on the flatbuffers code to
>             identify the culprit commit. With some luck, the root
>             cause might be obvious if a single culptrit commit can be
>             exhibited (perhaps some subtle C++ undefined behaviour
>             triggered? also it is a bit mysterious that it hits only
>             for static builds), or otherwise raise to the upstream
>             flatbuffers project to ask for their expertise
>
>             Even
>
>             Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
>>             I was able to create a fork of 3.7.3 with just the
>>             *flatbuffers* replaced with the pre-3.6.x version (2.0.0).
>>
>>             This seemed to only require changes to the
>>             version asserts and adding an *align* parameter to
>>             *Table::VerifyField()* to match the newer API.
>>
>>             https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0
>>
>>             Our system works correctly and passes all GDAL I/O tests
>>             with that version. Obviously this isn't an ideal
>>             solution, but this is otherwise a release blocker for us.
>>
>>             I would still very much like to discuss the original
>>             problem more deeply, and hopefully come up with a better
>>             solution.
>>
>>             Yours hopefully,
>>
>>             Simon
>>
>>
>>
>>             On Thu, Feb 22, 2024 at 10:22 PM Simon Eves
>>             <simon.eves at heavy.ai> wrote:
>>
>>                 Thank you, Robert, for the RR tip. I shall try it.
>>
>>                 I have new findings to report, however.
>>
>>                 First of all, I confirmed that a build against GDAL
>>                 3.4.1 (the version we were on before) still works. I
>>                 also confirmed that builds against 3.7.3 and 3.8.4
>>                 still failed even with no additional library
>>                 dependencies (just sqlite3 and proj), in case it was
>>                 a side-effect of us also adding more of those. I then
>>                 tried 3.5.3, with the CMake build (same config as we
>>                 use for 3.7.3) and that worked. I then tried 3.6.4
>>                 (again, same CMake config) and that failed. These
>>                 were all from bundles.
>>
>>                 I then started delving through the GDAL repo itself.
>>                 I found the common root commit of 3.5.3 and 3.6.4,
>>                 and all the commits in the
>>                 *ogr/ogrsf_frmts/flatgeobuf* sub-project between that
>>                 one and the final of each. For 3.5.3, this was only
>>                 two. I built and tested both, and they were fine. I
>>                 then tried the very first one that was new in the
>>                 3.6.4 chain (not in the history of 3.5.3), which was
>>                 actually a bulk update to the *flatbuffers*
>>                 sub-library, committed by Bjorn Harrtell on May 8
>>                 2022 (SHA f7d8876). That one had the issue. I then
>>                 tried the immediately-preceding commit (an unrelated
>>                 docs change) and that one was fine.
>>
>>                 My current hypothesis, therefore, is that the
>>                 *flatbuffers* update introduced the issue, or at
>>                 least, the susceptibility of the issue.
>>
>>                 I still cannot explain why it only occurs in an
>>                 all-static build, and even less able to explain why
>>                 it only occurs in our full system and not with the
>>                 simple test program against the very same static lib
>>                 build that does the very same sequence of GDAL API
>>                 calls, but I repeated the build tests of the commits
>>                 either side and a few other random ones a bit further
>>                 away in each direction, and the results were
>>                 consistent. Again, it happens with both GCC 11 and
>>                 Clang 14 builds, Debug or Release.
>>
>>                 I will continue tomorrow to look at the actual
>>                 changes to *flatbuffers* in that update, although
>>                 they are quite significant. Certainly, the
>>                 *vector_downward* class, which is directly involved,
>>                 was a new file in that update (although on inspection
>>                 of that file's history in the *google/flatbuffers*
>>                 repo, it seems it was just split out of another header).
>>
>>                 Bjorn, I don't mean to call you out directly, but I
>>                 am CC'ing you to ensure you see this, as you appear
>>                 to be a significant contributor to the *flatbuffers*
>>                 project itself. Any insight you may have would be
>>                 very welcome. I am of course happy to describe my
>>                 debugging findings in more detail, privately if you
>>                 wish, rather than spamming the list.
>>
>>                 Simon
>>
>>
>>
>>
>>
>>
>>                 On Tue, Feb 20, 2024 at 1:49 PM Robert Coup
>>                 <robert.coup at koordinates.com> wrote:
>>
>>                     Hi,
>>
>>                     On Tue, 20 Feb 2024 at 21:44, Robert Coup
>>                     <robert.coup at koordinates.com> wrote:
>>
>>                         Hi Simon,
>>
>>                         On Tue, 20 Feb 2024 at 21:11, Simon Eves
>>                         <simon.eves at heavy.ai> wrote:
>>
>>                             Here's the stack trace for the original
>>                             assert. Something is stepping on scratch_
>>                             to make it 0x1000000000 instead of null,
>>                             which it starts out as when the
>>                             flatbuffer object is created, but by the
>>                             time it gets to allocating memory, it's
>>                             broken.
>>
>>
>>                         What happens if you set a watchpoint in gdb
>>                         when the flatbuffer is created?
>>
>>                         watch -l myfb->scratch
>>                         or watch *0x1234c0ffee
>>
>>
>>                     Or I've also had success with Mozilla's rr:
>>                     https://rr-project.org/ — you can run to a point
>>                     where scratch is wrong, set a watchpoint on it,
>>                     and then run the program backwards to find out
>>                     what touched it.
>>
>>                     Rob :)
>>
>>
>>             _______________________________________________
>>             gdal-dev mailing list
>>             gdal-dev at lists.osgeo.org
>>             https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>             -- 
>             http://www.spatialys.com
>             My software is free, but my time generally not.
>
-- 
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240226/225a4413/attachment-0001.htm>


More information about the gdal-dev mailing list