[gdal-dev] Assert due to stack corruption in FlatGeoBuf export
Simon Eves
simon.eves at heavy.ai
Sun Feb 25 17:39:40 PST 2024
Arrow moved to using a custom namespace in v10.0.0
On Sun, Feb 25, 2024 at 5:26 PM Simon Eves <simon.eves at heavy.ai> wrote:
> Ooh, good call!
>
> That also corresponds with what I just tried, which was to leave the
> change in, but have the *size()* method return a value derived the old
> way instead of just returning *size_*, and also compare the two and log
> any mismatch. This also fails, which would seem to discount my thought that
> perhaps the math wasn't equivalent, and something else was getting confused
> by a different value returned from size() and then trampling on memory.
> However, no value mismatch is reported before it fails.
>
> (pause for search)
>
> So I scanned all the static libs in our dependency bundle with *nm*, and
> whaddya know... Apache Arrow (9.0.0) also uses *flatbuffers* and also
> with no namespace! I pulled the source, and it's v1.12.0... the
> *vector_downward* class has the same data members as the v2.0.0 in GDAL,
> without *size_*, which was inserted in the middle.
>
> The latest Arrow 15.0 uses the latest *flatbuffers* 23.5.26, but with a
> custom namespace. I'll look through to see when they did that. 9.0.0 is
> only 18 months old, but we could probably stand to upgrade that too.
>
> namespace arrow_vendored_private::flatbuffers {}
> namespace flatbuffers = arrow_vendored_private::flatbuffers;
>
> This also, of course, explains why we only hit the problem in the full
> server build, and I was unable to reproduce it with the simple test
> program, because that only linked GDAL and not Arrow too.
>
> OK, so I guess we might be able to avoid it by upgrading Arrow, as long as
> that doesn't break something else. I guess you need to do the custom
> namespace thing too, though.
>
> I hate computers.
>
> Simon
>
>
> On Sun, Feb 25, 2024 at 3:43 PM Even Rouault <even.rouault at spatialys.com>
> wrote:
>
>>
>>
>> Not obvious why that change would have broken anything, and certainly
>> still absolutely no idea why it only happens in a full static build.
>>
>> At that point, I would slightly bet on the fact that your whole
>> application would have another component using flatbuffers at a different
>> version, which wouldn't have the new vector_downward::size_ member.
>> Although I would expect that static linking would be in a better position
>> to detect duplicated symbols than dynamic linking...
>>
>> One thing we didn't do in GDAL is to add a GDAL specific namespace around
>> its flatbuffers component (we did that in MapServer to avoid potential
>> conflicts between MapServer's flatbuffers copy with the GDAL one)
>>
>> An interesting experiment would be to revert
>> https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688
>> but add a unused size_ member to see if that's enough to break things. Or
>> just scrumble a bit the order of members of vector_downward.
>>
>> Or try replacing the "flatbuffers" namespace by something like
>> "gdal_flatbuffers"
>>
>> Simon
>>
>>
>>
>> On Sat, Feb 24, 2024 at 5:27 PM Simon Eves <simon.eves at heavy.ai> wrote:
>>
>>> OK, so I tried a custom build of 3.7.3 with the latest *flatbuffers*
>>> (23.5.26), which was a drop-in replacement for 2.0.6 other than the version
>>> asserts.
>>>
>>> This does not exhibit the original problem either.
>>>
>>> However, while it produces files which the stock static build, the
>>> static build with the older *flatbuffers* (2.0.0), and the Ubuntu
>>> dynamic build, can all read just fine, it is unable to read ANY files back
>>> in again (in the context of our server geo importer, anyway).
>>>
>>> GDAL throws a *CE_Failure* of *Header failed consistency verification
>>> (1), *which is from *OGRFlatGeobufLayer::Open(),* and the dataset
>>> reports no layers (or at least, no vector layers).
>>>
>>> This also appears to be a side-effect of it being a static build, as
>>> *ogrinfo* built from the same source (with *flatbuffers* 2.0.0), but in
>>> regular shared libs mode, can read all three files just fine. I have been
>>> unable to achieve a full-static tools build, so I can't try that right now.
>>>
>>> This either means that the problem is still there in some form in the
>>> latest *flatbuffers*, but has moved, or that the higher-level FGB file
>>> schema verification can be affected by the *flatbuffers* version. Both
>>> are equally concerning.
>>>
>>> Anyway, the build with the older *flatbuffers* 2.0.0 extracted from the
>>> v3.5.3 tree (with the *Table::VerifyField* mod) seems to work fine in
>>> all ways, so we're probably gonna go with that, in the absence of anything
>>> else.
>>>
>>> One other weirdness is that, of the three files, the two produced by the
>>> static builds (*flatbuffers* 2.0.0 and *flatbuffers* 23.5.26) are 16
>>> bytes longer than the one from the Ubuntu dynamic build. All three read
>>> just fine with *ogrinfo* and our server geo importer, and result in the
>>> same table. Here is a link to a bundle with all three files plus the
>>> GeoJSON equivalent (*MULTIPOLYGON* US states with some metadata).
>>>
>>>
>>> https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing
>>>
>>> As ever, happy to get into the weeds with more details of the original
>>> problem, but pretty sure that 95% of the readers of this list don't want
>>> this thread to get any longer! :)
>>>
>>> Simon
>>>
>>>
>>> On Fri, Feb 23, 2024 at 3:46 PM Simon Eves <simon.eves at heavy.ai> wrote:
>>>
>>>> Our emails crossed. I am indeed testing with the latest flatbuffers now
>>>> too.
>>>>
>>>> Agreed on the rest.
>>>>
>>>> On Fri, Feb 23, 2024 at 3:42 PM Even Rouault <
>>>> even.rouault at spatialys.com> wrote:
>>>>
>>>>> Simon,
>>>>>
>>>>> did you try to update to the latest
>>>>> https://github.com/google/flatbuffers/releases to see if that would
>>>>> solve the issue ? If that worked, that would be the best way forward...
>>>>>
>>>>> Otherwise if the issue persists with the latest flatbuffers release, a
>>>>> (admitedly rather tedious) option would be to do a git bisect on the
>>>>> flatbuffers code to identify the culprit commit. With some luck, the root
>>>>> cause might be obvious if a single culptrit commit can be exhibited
>>>>> (perhaps some subtle C++ undefined behaviour triggered? also it is a bit
>>>>> mysterious that it hits only for static builds), or otherwise raise to the
>>>>> upstream flatbuffers project to ask for their expertise
>>>>>
>>>>> Even
>>>>> Le 23/02/2024 à 23:54, Simon Eves via gdal-dev a écrit :
>>>>>
>>>>> I was able to create a fork of 3.7.3 with just the *flatbuffers*
>>>>> replaced with the pre-3.6.x version (2.0.0).
>>>>>
>>>>> This seemed to only require changes to the version asserts and adding
>>>>> an *align* parameter to *Table::VerifyField()* to match the newer
>>>>> API.
>>>>>
>>>>>
>>>>> https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0
>>>>>
>>>>> Our system works correctly and passes all GDAL I/O tests with that
>>>>> version. Obviously this isn't an ideal solution, but this is otherwise a
>>>>> release blocker for us.
>>>>>
>>>>> I would still very much like to discuss the original problem more
>>>>> deeply, and hopefully come up with a better solution.
>>>>>
>>>>> Yours hopefully,
>>>>>
>>>>> Simon
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.eves at heavy.ai>
>>>>> wrote:
>>>>>
>>>>>> Thank you, Robert, for the RR tip. I shall try it.
>>>>>>
>>>>>> I have new findings to report, however.
>>>>>>
>>>>>> First of all, I confirmed that a build against GDAL 3.4.1 (the
>>>>>> version we were on before) still works. I also confirmed that builds
>>>>>> against 3.7.3 and 3.8.4 still failed even with no additional library
>>>>>> dependencies (just sqlite3 and proj), in case it was a side-effect of us
>>>>>> also adding more of those. I then tried 3.5.3, with the CMake build (same
>>>>>> config as we use for 3.7.3) and that worked. I then tried 3.6.4 (again,
>>>>>> same CMake config) and that failed. These were all from bundles.
>>>>>>
>>>>>> I then started delving through the GDAL repo itself. I found the
>>>>>> common root commit of 3.5.3 and 3.6.4, and all the commits in the
>>>>>> *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the
>>>>>> final of each. For 3.5.3, this was only two. I built and tested both, and
>>>>>> they were fine. I then tried the very first one that was new in the 3.6.4
>>>>>> chain (not in the history of 3.5.3), which was actually a bulk update to
>>>>>> the *flatbuffers* sub-library, committed by Bjorn Harrtell on May 8
>>>>>> 2022 (SHA f7d8876). That one had the issue. I then tried the
>>>>>> immediately-preceding commit (an unrelated docs change) and that one was
>>>>>> fine.
>>>>>>
>>>>>> My current hypothesis, therefore, is that the *flatbuffers* update
>>>>>> introduced the issue, or at least, the susceptibility of the issue.
>>>>>>
>>>>>> I still cannot explain why it only occurs in an all-static build, and
>>>>>> even less able to explain why it only occurs in our full system and not
>>>>>> with the simple test program against the very same static lib build that
>>>>>> does the very same sequence of GDAL API calls, but I repeated the build
>>>>>> tests of the commits either side and a few other random ones a bit further
>>>>>> away in each direction, and the results were consistent. Again, it happens
>>>>>> with both GCC 11 and Clang 14 builds, Debug or Release.
>>>>>>
>>>>>> I will continue tomorrow to look at the actual changes to
>>>>>> *flatbuffers* in that update, although they are quite significant.
>>>>>> Certainly, the *vector_downward* class, which is directly involved,
>>>>>> was a new file in that update (although on inspection of that file's
>>>>>> history in the *google/flatbuffers* repo, it seems it was just split
>>>>>> out of another header).
>>>>>>
>>>>>> Bjorn, I don't mean to call you out directly, but I am CC'ing you to
>>>>>> ensure you see this, as you appear to be a significant contributor to the
>>>>>> *flatbuffers* project itself. Any insight you may have would be very
>>>>>> welcome. I am of course happy to describe my debugging findings in more
>>>>>> detail, privately if you wish, rather than spamming the list.
>>>>>>
>>>>>> Simon
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 20, 2024 at 1:49 PM Robert Coup <
>>>>>> robert.coup at koordinates.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Tue, 20 Feb 2024 at 21:44, Robert Coup <
>>>>>>> robert.coup at koordinates.com> wrote:
>>>>>>>
>>>>>>>> Hi Simon,
>>>>>>>>
>>>>>>>> On Tue, 20 Feb 2024 at 21:11, Simon Eves <simon.eves at heavy.ai>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Here's the stack trace for the original assert. Something is
>>>>>>>>> stepping on scratch_ to make it 0x1000000000 instead of null, which it
>>>>>>>>> starts out as when the flatbuffer object is created, but by the time it
>>>>>>>>> gets to allocating memory, it's broken.
>>>>>>>>>
>>>>>>>>
>>>>>>>> What happens if you set a watchpoint in gdb when the flatbuffer is
>>>>>>>> created?
>>>>>>>>
>>>>>>>> watch -l myfb->scratch
>>>>>>>> or watch *0x1234c0ffee
>>>>>>>>
>>>>>>>
>>>>>>> Or I've also had success with Mozilla's rr: https://rr-project.org/
>>>>>>> — you can run to a point where scratch is wrong, set a watchpoint on it,
>>>>>>> and then run the program backwards to find out what touched it.
>>>>>>>
>>>>>>> Rob :)
>>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> gdal-dev mailing listgdal-dev at lists.osgeo.orghttps://lists.osgeo.org/mailman/listinfo/gdal-dev
>>>>>
>>>>> -- http://www.spatialys.com
>>>>> My software is free, but my time generally not.
>>>>>
>>>>> -- http://www.spatialys.com
>> My software is free, but my time generally not.
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20240225/4c802f54/attachment-0001.htm>
More information about the gdal-dev
mailing list