<div dir="ltr"><div>Ooh, good call!</div><div><br></div><div>That also corresponds with what I just tried, which was to leave the change in, but have the <b>size()</b> method return a value derived the old way instead of just returning <b>size_</b>, and also compare the two and log any mismatch. This also fails, which would seem to discount my thought that perhaps the math wasn't equivalent, and something else was getting confused by a different value returned from size() and then trampling on memory. However, no value mismatch is reported before it fails.<div><br></div></div><div>(pause for search)</div><div><br></div><div>So I scanned all the static libs in our dependency bundle with <b>nm</b>, and whaddya know... Apache Arrow (9.0.0) also uses <b>flatbuffers</b> and also with no namespace! I pulled the source, and it's v1.12.0... the <b>vector_downward</b> class has the same data members as the v2.0.0 in GDAL, without <b>size_</b>, which was inserted in the middle. </div><div><br></div><div>The latest Arrow 15.0 uses the latest <b>flatbuffers</b> 23.5.26, but with a custom namespace. I'll look through to see when they did that. 9.0.0 is only 18 months old, but we could probably stand to upgrade that too.</div><div><br></div><div><font face="monospace">namespace arrow_vendored_private::flatbuffers {}<br>namespace flatbuffers = arrow_vendored_private::flatbuffers;<br></font></div><div><br></div><div>This also, of course, explains why we only hit the problem in the full server build, and I was unable to reproduce it with the simple test program, because that only linked GDAL and not Arrow too.</div><div><br></div><div>OK, so I guess we might be able to avoid it by upgrading Arrow, as long as that doesn't break something else. I guess you need to do the custom namespace thing too, though.</div><div><br></div><div>I hate computers.</div><div><br></div><div>Simon</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Feb 25, 2024 at 3:43 PM Even Rouault <<a href="mailto:even.rouault@spatialys.com" target="_blank">even.rouault@spatialys.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>
<div>
<br>
<blockquote type="cite">
<div dir="ltr">
<div>
<div><br>
</div>
<div>Not obvious why that change would have broken anything,
and certainly still absolutely no idea why it only happens
in a full static build.</div>
</div>
</div>
</blockquote>
<p>At that point, I would slightly bet on the fact that your whole
application would have another component using flatbuffers at a
different version, which wouldn't have the new vector_downward::size_
member. Although I would expect that static linking would be in a
better position to detect duplicated symbols than dynamic
linking...</p>
<p>One thing we didn't do in GDAL is to add a GDAL specific
namespace around its flatbuffers component (we did that in
MapServer to avoid potential conflicts between MapServer's
flatbuffers copy with the GDAL one)<br>
<br>
</p>
<p>An interesting experiment would be to revert
<a href="https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688" target="_blank">https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688</a>
but add a unused size_ member to see if that's enough to break
things. Or just scrumble a bit the order of members of
vector_downward.</p>
<p>Or try replacing the "flatbuffers" namespace by something like
"gdal_flatbuffers"<br>
</p>
<br>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>Simon<br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Sat, Feb 24, 2024 at
5:27 PM Simon Eves <<a href="mailto:simon.eves@heavy.ai" target="_blank">simon.eves@heavy.ai</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">OK, so I tried a custom build of 3.7.3 with the
latest <b>flatbuffers</b> (23.5.26), which was a drop-in
replacement for 2.0.6 other than the version asserts.
<div><br>
</div>
<div>This does not exhibit the original problem either.</div>
<div><br>
</div>
<div>However, while it produces files which the stock static
build, the static build with the older <b>flatbuffers</b>
(2.0.0), and the Ubuntu dynamic build, can all read just
fine, it is unable to read ANY files back in again (in the
context of our server geo importer, anyway).</div>
<div><br>
</div>
<div>GDAL throws a <b>CE_Failure</b> of <b>Header failed
consistency verification (1), </b>which is from <b>OGRFlatGeobufLayer::Open(),</b> and
the dataset reports no layers (or at least, no
vector layers).</div>
<div><br>
</div>
<div>This also appears to be a side-effect of it being a
static build, as <b>ogrinfo</b> built from the same
source (with <b>flatbuffers</b> 2.0.0), but in regular
shared libs mode, can read all three files just fine. I
have been unable to achieve a full-static tools build, so
I can't try that right now.</div>
<div><br>
</div>
<div>This either means that the problem is still there in
some form in the latest <b>flatbuffers</b>, but has
moved, or that the higher-level FGB file schema
verification can be affected by the <b>flatbuffers</b>
version. Both are equally concerning.</div>
<div><br>
</div>
<div>Anyway, the build with the older <b>flatbuffers</b>
2.0.0 extracted from the v3.5.3 tree (with the <b>Table::VerifyField</b>
mod) seems to work fine in all ways, so we're probably
gonna go with that, in the absence of anything else.</div>
<div><br>
</div>
<div>One other weirdness is that, of the three files, the
two produced by the static builds (<b>flatbuffers</b>
2.0.0 and <b>flatbuffers</b> 23.5.26) are 16 bytes longer
than the one from the Ubuntu dynamic build. All three read
just fine with <b>ogrinfo</b> and our server geo importer,
and result in the same table. Here is a link to a bundle
with all three files plus the GeoJSON equivalent (<b>MULTIPOLYGON</b>
US states with some metadata).</div>
<div><br>
</div>
<div><a href="https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing" target="_blank">https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing</a><br>
</div>
<div><br>
</div>
<div>As ever, happy to get into the weeds with more details
of the original problem, but pretty sure that 95% of the
readers of this list don't want this thread to get any
longer! :)</div>
<div><br>
</div>
<div>Simon</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Feb 23, 2024 at
3:46 PM Simon Eves <<a href="mailto:simon.eves@heavy.ai" target="_blank">simon.eves@heavy.ai</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Our emails crossed. I am indeed testing
with the latest flatbuffers now too.
<div><br>
</div>
<div>Agreed on the rest.</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Feb 23, 2024
at 3:42 PM Even Rouault <<a href="mailto:even.rouault@spatialys.com" target="_blank">even.rouault@spatialys.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Simon,</p>
<p>did you try to update to the latest <a href="https://github.com/google/flatbuffers/releases" target="_blank">https://github.com/google/flatbuffers/releases</a>
to see if that would solve the issue ? If that
worked, that would be the best way forward...</p>
<p>Otherwise if the issue persists with the latest
flatbuffers release, a (admitedly rather tedious)
option would be to do a git bisect on the
flatbuffers code to identify the culprit commit.
With some luck, the root cause might be obvious if
a single culptrit commit can be exhibited (perhaps
some subtle C++ undefined behaviour triggered?
also it is a bit mysterious that it hits only for
static builds), or otherwise raise to the upstream
flatbuffers project to ask for their expertise</p>
<p>Even<br>
</p>
<div>Le 23/02/2024 à 23:54, Simon Eves via gdal-dev
a écrit :<br>
</div>
<blockquote type="cite">
<div dir="ltr">I was able to create a fork of
3.7.3 with just the <b>flatbuffers</b> replaced
with the pre-3.6.x version (2.0.0).
<div><br>
</div>
<div>This seemed to only require changes to the
version asserts and adding an <b>align</b>
parameter to <b>Table::VerifyField()</b> to
match the newer API.
<div><br>
</div>
<div><a href="https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0" target="_blank">https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0</a><br>
</div>
<div><br>
</div>
<div>Our system works correctly and passes all
GDAL I/O tests with that version. Obviously
this isn't an ideal solution, but this is
otherwise a release blocker for us.</div>
<div><br>
</div>
<div>I would still very much like to discuss
the original problem more deeply, and
hopefully come up with a better solution.</div>
<div><br>
</div>
<div>Yours hopefully,</div>
<div><br>
</div>
<div>Simon</div>
<div><br>
</div>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Feb
22, 2024 at 10:22 PM Simon Eves <<a href="mailto:simon.eves@heavy.ai" target="_blank">simon.eves@heavy.ai</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Thank you, Robert, for the RR
tip. I shall try it.
<div><br>
</div>
<div>I have new findings to report, however.</div>
<div><br>
</div>
<div>First of all, I confirmed that a build
against GDAL 3.4.1 (the version we were on
before) still works. I also confirmed that
builds against 3.7.3 and 3.8.4 still
failed even with no additional library
dependencies (just sqlite3 and proj), in
case it was a side-effect of us also
adding more of those. I then tried 3.5.3,
with the CMake build (same config as we
use for 3.7.3) and that worked. I then
tried 3.6.4 (again, same CMake config) and
that failed. These were all from bundles.</div>
<div><br>
</div>
<div>I then started delving through the GDAL
repo itself. I found the common root
commit of 3.5.3 and 3.6.4, and all the
commits in the <b>ogr/ogrsf_frmts/flatgeobuf</b> sub-project
between that one and the final of each.
For 3.5.3, this was only two. I built and
tested both, and they were fine. I then
tried the very first one that was new in
the 3.6.4 chain (not in the history of
3.5.3), which was actually a bulk update
to the <b>flatbuffers</b> sub-library,
committed by Bjorn Harrtell on May 8 2022
(SHA f7d8876). That one had the issue. I
then tried the immediately-preceding
commit (an unrelated docs change) and that
one was fine.</div>
<div><br>
</div>
<div>My current hypothesis, therefore, is
that the <b>flatbuffers</b> update
introduced the issue, or at least, the
susceptibility of the issue.</div>
<div><br>
</div>
<div>I still cannot explain why it only
occurs in an all-static build, and even
less able to explain why it only occurs in
our full system and not with the simple
test program against the very same static
lib build that does the very same sequence
of GDAL API calls, but I repeated the
build tests of the commits either side and
a few other random ones a bit further away
in each direction, and the results were
consistent. Again, it happens with both
GCC 11 and Clang 14 builds, Debug or
Release.<br>
</div>
<div><br>
</div>
<div>I will continue tomorrow to look at the
actual changes to <b>flatbuffers</b> in
that update, although they are quite
significant. Certainly, the <b>vector_downward</b>
class, which is directly involved, was a
new file in that update (although on
inspection of that file's history in the <b>google/flatbuffers</b>
repo, it seems it was just split out of
another header).</div>
<div><br>
</div>
<div>Bjorn, I don't mean to call you out
directly, but I am CC'ing you to ensure
you see this, as you appear to be a
significant contributor to the <b>flatbuffers</b>
project itself. Any insight you may have
would be very welcome. I am of course
happy to describe my debugging findings in
more detail, privately if you wish, rather
than spamming the list.</div>
<div><br>
</div>
<div>Simon</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue,
Feb 20, 2024 at 1:49 PM Robert Coup <<a href="mailto:robert.coup@koordinates.com" target="_blank">robert.coup@koordinates.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div dir="ltr">Hi,</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">On Tue, 20 Feb 2024 at
21:44, Robert Coup <<a href="mailto:robert.coup@koordinates.com" target="_blank">robert.coup@koordinates.com</a>>
wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>Hi Simon,</div>
<div><br>
</div>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, 20
Feb 2024 at 21:11, Simon Eves
<<a href="mailto:simon.eves@heavy.ai" target="_blank">simon.eves@heavy.ai</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Here's the
stack trace for the original
assert. Something is
stepping on scratch_ to make
it 0x1000000000 instead of
null, which it starts out as
when the flatbuffer object
is created, but by the time
it gets to allocating
memory, it's broken.</div>
</blockquote>
<div><br>
</div>
What happens if you set a
watchpoint in gdb when the
flatbuffer is created?
<div><br>
</div>
<div><span style="color:rgb(0,0,0)"><font face="monospace">watch -l
myfb->scratch</font></span></div>
<div><span style="color:rgb(0,0,0)">or </span><span style="color:rgb(0,0,0);font-family:monospace">watch *0x1234c0ffee</span></div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div dir="ltr">Or I've also had
success with Mozilla's rr: <a href="https://rr-project.org/" target="_blank">https://rr-project.org/</a>
— you can run to a point where
scratch is wrong, set a watchpoint
on it, and then run the program
backwards to find out what touched
it.</div>
<div dir="ltr"><br>
</div>
<div>Rob :)</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
gdal-dev mailing list
<a href="mailto:gdal-dev@lists.osgeo.org" target="_blank">gdal-dev@lists.osgeo.org</a>
<a href="https://lists.osgeo.org/mailman/listinfo/gdal-dev" target="_blank">https://lists.osgeo.org/mailman/listinfo/gdal-dev</a>
</pre>
</blockquote>
<pre cols="72">--
<a href="http://www.spatialys.com" target="_blank">http://www.spatialys.com</a>
My software is free, but my time generally not.</pre>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
<pre cols="72">--
<a href="http://www.spatialys.com" target="_blank">http://www.spatialys.com</a>
My software is free, but my time generally not.</pre>
</div>
</blockquote></div>