[gdal-dev] Assert due to stack corruption in FlatGeoBuf export
Simon Eves
simon.eves at heavy.ai
Sun Feb 25 17:26:01 PST 2024
Ooh, good call!
That also corresponds with what I just tried, which was to leave the change
in, but have the *size()* method return a value derived the old way instead
of just returning *size_*, and also compare the two and log any mismatch.
This also fails, which would seem to discount my thought that perhaps the
math wasn't equivalent, and something else was getting confused by a
different value returned from size() and then trampling on memory. However,
no value mismatch is reported before it fails.
(pause for search)
So I scanned all the static libs in our dependency bundle with *nm*, and
whaddya know... Apache Arrow (9.0.0) also uses *flatbuffers* and also with
no namespace! I pulled the source, and it's v1.12.0... the *vector_downward*
class has the same data members as the v2.0.0 in GDAL, without *size_*,
which was inserted in the middle.
The latest Arrow 15.0 uses the latest *flatbuffers* 23.5.26, but with a
custom namespace. I'll look through to see when they did that. 9.0.0 is
only 18 months old, but we could probably stand to upgrade that too.
namespace arrow_vendored_private::flatbuffers {}
namespace flatbuffers = arrow_vendored_private::flatbuffers;
This also, of course, explains why we only hit the problem in the full
server build, and I was unable to reproduce it with the simple test
program, because that only linked GDAL and not Arrow too.
OK, so I guess we might be able to avoid it by upgrading Arrow, as long as
that doesn't break something else. I guess you need to do the custom
namespace thing too, though.
I hate computers.
Simon
On Sun, Feb 25, 2024 at 3:43 PM Even Rouault <even.rouault at spatialys.com>
wrote:
>
>
> Not obvious why that change would have broken anything, and certainly
> still absolutely no idea why it only happens in a full static build.
>
> At that point, I would slightly bet on the fact that your whole
> application would have another component using flatbuffers at a different
> version, which wouldn't have the new vector_downward::size_ member.
> Although I would expect that static linking would be in a better position
> to detect duplicated symbols than dynamic linking...
>
> One thing we didn't do in GDAL is to add a GDAL specific namespace around
> its flatbuffers component (we did that in MapServer to avoid potential
> conflicts between MapServer's flatbuffers copy with the GDAL one)
>
> An interesting experiment would be to revert
> https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688
> but add a unused size_ member to see if that's enough to break things. Or
> just scrumble a bit the order of members of vector_downward.
>
> Or try replacing the "flatbuffers" namespace by something like
> "gdal_flatbuffers"
>
> Simon
>
>
>
> On Sat, Feb 24, 2024 at 5:27 PM Simon Eves <simon.eves at heavy.ai> wrote:
>
>> OK, so I tried a custom build of 3.7.3 with the latest *flatbuffers*
>> (23.5.26), which was a drop-in replacement for 2.0.6 other than the version
>> asserts.
>>
>> This does not exhibit the original problem either.
>>
>> However, while it produces files which the stock static build, the static
>> build with the older *flatbuffers* (2.0.0), and the Ubuntu dynamic
>> build, can all read just fine, it is unable to read ANY files back in again
>> (in the context of our server geo importer, anyway).
>>
>> GDAL throws a *CE_Failure* of *Header failed consistency verification
>> (1), *which is from *OGRFlatGeobufLayer::Open(),* and the dataset
>> reports no layers (or at least, no vector layers).
>>
>> This also appears to be a side-effect of it being a static build, as
>> *ogrinfo* built from the same source (with *flatbuffers* 2.0.0), but in
>> regular shared libs mode, can read all three files just fine. I have been
>> unable to achieve a full-static tools build, so I can't try that right now.
>>
>> This either means that the problem is still there in some form in the
>> latest *flatbuffers*, but has moved, or that the higher-level FGB file
>> schema verification can be affected by the *flatbuffers* version. Both
>> are equally concerning.
>>
>> Anyway, the build with the older *flatbuffers* 2.0.0 extracted from the
>> v3.5.3 tree (with the *Table::VerifyField* mod) seems to work fine in
>> all ways, so we're probably gonna go with that, in the absence of anything
>> else.
>>
>> One other weirdness is that, of the three files, the two produced by the
>> static builds (*flatbuffers* 2.0.0 and *flatbuffers* 23.5.26) are 16
>> bytes longer than the one from the Ubuntu dynamic build. All three read
>> just fine with *ogrinfo* and our server geo importer, and result in the
>> same table. Here is a link to a bundle with all three files plus the
>> GeoJSON equivalent (*MULTIPOLYGON* US states with some metadata).
>>
>>
>> https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing
>>
>> As ever, happy to get into the weeds with more details of the original
>> problem, but pretty sure that 95% of the readers of this list don't want
>> this thread to get any longer! :)
>>
>> Simon
>>
>>
>> On Fri, Feb 23, 2024 at 3:46 PM Simon Eves <simon.eves at heavy.ai> wrote:
>>
>>> Our emails crossed. I am indeed testing with the latest flatbuffers now
>>> too.
>>>
>>> Agreed on the rest.
>>>
>>> On Fri, Feb 23, 2024 at 3:42 PM Even Rouault <even.rouault at spatialys.com>
>>> wrote:
>>>
>>>> Simon,
>>>>
>>>> did you try to update to the latest
>>>> https://github.com/google/flatbuffers/releases to see if that would
>>>> solve the issue ? If that worked, that would be the best way forward...
>>>>
>>>> Otherwise if the issue persists with the latest flatbuffers release, a
>>>> (admitedly rather tedious) option would be to do a git bisect on the
>>>> flatbuffers code to identify the culprit commit. With some luck, the root
>>>> cause might be obvious if a single culptrit commit can be exhibited
>>>> (perhaps some subtle C++ undefined behaviour triggered? also it is a bit
>>>> mysterious that it hits only for static builds), or otherwise raise to the
>>>> upstream flatbuffers project to ask for their expertise
>>>>
>>>> Even
>>>> Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
>>>>
>>>> I was able to create a fork of 3.7.3 with just the *flatbuffers*
>>>> replaced with the pre-3.6.x version (2.0.0).
>>>>
>>>> This seemed to only require changes to the version asserts and adding
>>>> an *align* parameter to *Table::VerifyField()* to match the newer API.
>>>>
>>>>
>>>> https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0
>>>>
>>>> Our system works correctly and passes all GDAL I/O tests with that
>>>> version. Obviously this isn't an ideal solution, but this is otherwise a
>>>> release blocker for us.
>>>>
>>>> I would still very much like to discuss the original problem more
>>>> deeply, and hopefully come up with a better solution.
>>>>
>>>> Yours hopefully,
>>>>
>>>> Simon
>>>>
>>>>
>>>>
>>>> On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.eves at heavy.ai>
>>>> wrote:
>>>>
>>>>> Thank you, Robert, for the RR tip. I shall try it.
>>>>>
>>>>> I have new findings to report, however.
>>>>>
>>>>> First of all, I confirmed that a build against GDAL 3.4.1 (the version
>>>>> we were on before) still works. I also confirmed that builds against 3.7.3
>>>>> and 3.8.4 still failed even with no additional library dependencies (just
>>>>> sqlite3 and proj), in case it was a side-effect of us also adding more of
>>>>> those. I then tried 3.5.3, with the CMake build (same config as we use for
>>>>> 3.7.3) and that worked. I then tried 3.6.4 (again, same CMake config) and
>>>>> that failed. These were all from bundles.
>>>>>
>>>>> I then started delving through the GDAL repo itself. I found the
>>>>> common root commit of 3.5.3 and 3.6.4, and all the commits in the
>>>>> *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the
>>>>> final of each. For 3.5.3, this was only two. I built and tested both, and
>>>>> they were fine. I then tried the very first one that was new in the 3.6.4
>>>>> chain (not in the history of 3.5.3), which was actually a bulk update to
>>>>> the *flatbuffers* sub-library, committed by Bjorn Harrtell on May 8
>>>>> 2022 (SHA f7d8876). That one had the issue. I then tried the
>>>>> immediately-preceding commit (an unrelated docs change) and that one was
>>>>> fine.
>>>>>
>>>>> My current hypothesis, therefore, is that the *flatbuffers* update
>>>>> introduced the issue, or at least, the susceptibility of the issue.
>>>>>
>>>>> I still cannot explain why it only occurs in an all-static build, and
>>>>> even less able to explain why it only occurs in our full system and not
>>>>> with the simple test program against the very same static lib build that
>>>>> does the very same sequence of GDAL API calls, but I repeated the build
>>>>> tests of the commits either side and a few other random ones a bit further
>>>>> away in each direction, and the results were consistent. Again, it happens
>>>>> with both GCC 11 and Clang 14 builds, Debug or Release.
>>>>>
>>>>> I will continue tomorrow to look at the actual changes to
>>>>> *flatbuffers* in that update, although they are quite significant.
>>>>> Certainly, the *vector_downward* class, which is directly involved,
>>>>> was a new file in that update (although on inspection of that file's
>>>>> history in the *google/flatbuffers* repo, it seems it was just split
>>>>> out of another header).
>>>>>
>>>>> Bjorn, I don't mean to call you out directly, but I am CC'ing you to
>>>>> ensure you see this, as you appear to be a significant contributor to the
>>>>> *flatbuffers* project itself. Any insight you may have would be very
>>>>> welcome. I am of course happy to describe my debugging findings in more
>>>>> detail, privately if you wish, rather than spamming the list.
>>>>>
>>>>> Simon
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 20, 2024 at 1:49 PM Robert Coup <
>>>>> robert.coup at koordinates.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On Tue, 20 Feb 2024 at 21:44, Robert Coup <
>>>>>> robert.coup at koordinates.com> wrote:
>>>>>>
>>>>>>> Hi Simon,
>>>>>>>
>>>>>>> On Tue, 20 Feb 2024 at 21:11, Simon Eves <simon.eves at heavy.ai>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here's the stack trace for the original assert. Something is
>>>>>>>> stepping on scratch_ to make it 0x1000000000 instead of null, which it
>>>>>>>> starts out as when the flatbuffer object is created, but by the time it
>>>>>>>> gets to allocating memory, it's broken.
>>>>>>>>
>>>>>>>
>>>>>>> What happens if you set a watchpoint in gdb when the flatbuffer is
>>>>>>> created?
>>>>>>>
>>>>>>> watch -l myfb->scratch
>>>>>>> or watch *0x1234c0ffee
>>>>>>>
>>>>>>
>>>>>> Or I've also had success with Mozilla's rr: https://rr-project.org/
>>>>>> — you can run to a point where scratch is wrong, set a watchpoint on it,
>>>>>> and then run the program backwards to find out what touched it.
>>>>>>
>>>>>> Rob :)
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> gdal-dev mailing listgdal-dev at lists.osgeo.orghttps://lists.osgeo.org/mailman/listinfo/gdal-dev
>>>>
>>>> -- http://www.spatialys.com
>>>> My software is free, but my time generally not.
>>>>
>>>> -- http://www.spatialys.com
> My software is free, but my time generally not.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240225/d7f493d0/attachment-0001.htm>
More information about the gdal-dev
mailing list