<div dir="ltr">Arrow moved to using a custom namespace in v10.0.0</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Feb 25, 2024 at 5:26 PM Simon Eves <<a href="mailto:simon.eves@heavy.ai">simon.eves@heavy.ai</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Ooh, good call!</div><div><br></div><div>That also corresponds with what I just tried, which was to leave the change in, but have the <b>size()</b> method return a value derived the old way instead of just returning <b>size_</b>, and also compare the two and log any mismatch. This also fails, which would seem to discount my thought that perhaps the math wasn't equivalent, and something else was getting confused by a different value returned from size() and then trampling on memory. However, no value mismatch is reported before it fails.<div><br></div></div><div>(pause for search)</div><div><br></div><div>So I scanned all the static libs in our dependency bundle with <b>nm</b>, and whaddya know... Apache Arrow (9.0.0) also uses <b>flatbuffers</b> and also with no namespace! I pulled the source, and it's v1.12.0... the <b>vector_downward</b> class has the same data members as the v2.0.0 in GDAL, without <b>size_</b>, which was inserted in the middle. </div><div><br></div><div>The latest Arrow 15.0 uses the latest <b>flatbuffers</b> 23.5.26, but with a custom namespace. I'll look through to see when they did that. 9.0.0 is only 18 months old, but we could probably stand to upgrade that too.</div><div><br></div><div><font face="monospace">namespace arrow_vendored_private::flatbuffers {}<br>namespace flatbuffers = arrow_vendored_private::flatbuffers;<br></font></div><div><br></div><div>This also, of course, explains why we only hit the problem in the full server build, and I was unable to reproduce it with the simple test program, because that only linked GDAL and not Arrow too.</div><div><br></div><div>OK, so I guess we might be able to avoid it by upgrading Arrow, as long as that doesn't break something else. I guess you need to do the custom namespace thing too, though.</div><div><br></div><div>I hate computers.</div><div><br></div><div>Simon</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Feb 25, 2024 at 3:43 PM Even Rouault <<a href="mailto:even.rouault@spatialys.com" target="_blank">even.rouault@spatialys.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>

  
    
  
  <div>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div><br>
          </div>
          <div>Not obvious why that change would have broken anything,
            and certainly still absolutely no idea why it only happens
            in a full static build.</div>
        </div>
      </div>
    </blockquote>
    <p>At that point, I would slightly bet on the fact that your whole
      application would have another component using flatbuffers at a
      different version, which wouldn't have the new vector_downward::size_
      member. Although I would expect that static linking would be in a
      better position to detect duplicated symbols than dynamic
      linking...</p>
    <p>One thing we didn't do in GDAL is to add a GDAL specific
      namespace around its flatbuffers component (we did that in
      MapServer to avoid potential conflicts between MapServer's
      flatbuffers copy with the GDAL one)<br>
      <br>
    </p>
    <p>An interesting experiment would be to revert
<a href="https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688" target="_blank">https://github.com/google/flatbuffers/commit/9e4ca857b6dadf116703f612187e33b7d4bb6688</a>
      but add a unused size_ member to see if that's enough to break
      things.  Or just scrumble a bit the order of members of
      vector_downward.</p>
    <p>Or try replacing the "flatbuffers" namespace by something like
      "gdal_flatbuffers"<br>
    </p>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div>Simon<br>
          </div>
          <div><br>
          </div>
          <div><br>
          </div>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Sat, Feb 24, 2024 at
          5:27 PM Simon Eves <<a href="mailto:simon.eves@heavy.ai" target="_blank">simon.eves@heavy.ai</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="ltr">OK, so I tried a custom build of 3.7.3 with the
            latest <b>flatbuffers</b> (23.5.26), which was a drop-in
            replacement for 2.0.6 other than the version asserts.
            <div><br>
            </div>
            <div>This does not exhibit the original problem either.</div>
            <div><br>
            </div>
            <div>However, while it produces files which the stock static
              build, the static build with the older <b>flatbuffers</b>
              (2.0.0), and the Ubuntu dynamic build, can all read just
              fine, it is unable to read ANY files back in again (in the
              context of our server geo importer, anyway).</div>
            <div><br>
            </div>
            <div>GDAL throws a <b>CE_Failure</b> of <b>Header failed
                consistency verification (1), </b>which is from <b>OGRFlatGeobufLayer::Open(),</b> and
              the dataset reports no layers (or at least, no
              vector layers).</div>
            <div><br>
            </div>
            <div>This also appears to be a side-effect of it being a
              static build, as <b>ogrinfo</b> built from the same
              source (with <b>flatbuffers</b> 2.0.0), but in regular
              shared libs mode, can read all three files just fine. I
              have been unable to achieve a full-static tools build, so
              I can't try that right now.</div>
            <div><br>
            </div>
            <div>This either means that the problem is still there in
              some form in the latest <b>flatbuffers</b>, but has
              moved, or that the higher-level FGB file schema
              verification can be affected by the <b>flatbuffers</b>
              version. Both are equally concerning.</div>
            <div><br>
            </div>
            <div>Anyway, the build with the older <b>flatbuffers</b>
              2.0.0 extracted from the v3.5.3 tree (with the <b>Table::VerifyField</b>
              mod) seems to work fine in all ways, so we're probably
              gonna go with that, in the absence of anything else.</div>
            <div><br>
            </div>
            <div>One other weirdness is that, of the three files, the
              two produced by the static builds (<b>flatbuffers</b>
              2.0.0 and <b>flatbuffers</b> 23.5.26) are 16 bytes longer
              than the one from the Ubuntu dynamic build. All three read
              just fine with <b>ogrinfo</b> and our server geo importer,
              and result in the same table. Here is a link to a bundle
              with all three files plus the GeoJSON equivalent (<b>MULTIPOLYGON</b>
              US states with some metadata).</div>
            <div><br>
            </div>
            <div><a href="https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing" target="_blank">https://drive.google.com/file/d/1ETRuV63gvUL4aNAT_4KvjrtK1uiCrFun/view?usp=sharing</a><br>
            </div>
            <div><br>
            </div>
            <div>As ever, happy to get into the weeds with more details
              of the original problem, but pretty sure that 95% of the
              readers of this list don't want this thread to get any
              longer! :)</div>
            <div><br>
            </div>
            <div>Simon</div>
            <div><br>
            </div>
          </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Fri, Feb 23, 2024 at
              3:46 PM Simon Eves <<a href="mailto:simon.eves@heavy.ai" target="_blank">simon.eves@heavy.ai</a>>
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div dir="ltr">Our emails crossed. I am indeed testing
                with the latest flatbuffers now too.
                <div><br>
                </div>
                <div>Agreed on the rest.</div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Fri, Feb 23, 2024
                  at 3:42 PM Even Rouault <<a href="mailto:even.rouault@spatialys.com" target="_blank">even.rouault@spatialys.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                  <div>
                    <p>Simon,</p>
                    <p>did you try to update to the latest <a href="https://github.com/google/flatbuffers/releases" target="_blank">https://github.com/google/flatbuffers/releases</a>
                      to see if that would solve the issue ? If that
                      worked, that would be the best way forward...</p>
                    <p>Otherwise if the issue persists with the latest
                      flatbuffers release, a (admitedly rather tedious)
                      option would be to do a git bisect on the
                      flatbuffers code to identify the culprit commit.
                      With some luck, the root cause might be obvious if
                      a single culptrit commit can be exhibited (perhaps
                      some subtle C++ undefined behaviour triggered?
                      also it is a bit mysterious that it hits only for
                      static builds), or otherwise raise to the upstream
                      flatbuffers project to ask for their expertise</p>
                    <p>Even<br>
                    </p>
                    <div>Le 23/02/2024 à 23:54, Simon Eves via gdal-dev
                      a écrit :<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">I was able to create a fork of
                        3.7.3 with just the <b>flatbuffers</b> replaced
                        with the pre-3.6.x version (2.0.0).
                        <div><br>
                        </div>
                        <div>This seemed to only require changes to the
                          version asserts and adding an <b>align</b>
                          parameter to <b>Table::VerifyField()</b> to
                          match the newer API.
                          <div><br>
                          </div>
                          <div><a href="https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0" target="_blank">https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0</a><br>
                          </div>
                          <div><br>
                          </div>
                          <div>Our system works correctly and passes all
                            GDAL I/O tests with that version. Obviously
                            this isn't an ideal solution, but this is
                            otherwise a release blocker for us.</div>
                          <div><br>
                          </div>
                          <div>I would still very much like to discuss
                            the original problem more deeply, and
                            hopefully come up with a better solution.</div>
                          <div><br>
                          </div>
                          <div>Yours hopefully,</div>
                          <div><br>
                          </div>
                          <div>Simon</div>
                          <div><br>
                          </div>
                          <div><br>
                          </div>
                        </div>
                      </div>
                      <br>
                      <div class="gmail_quote">
                        <div dir="ltr" class="gmail_attr">On Thu, Feb
                          22, 2024 at 10:22 PM Simon Eves <<a href="mailto:simon.eves@heavy.ai" target="_blank">simon.eves@heavy.ai</a>>
                          wrote:<br>
                        </div>
                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                          <div dir="ltr">Thank you, Robert, for the RR
                            tip. I shall try it.
                            <div><br>
                            </div>
                            <div>I have new findings to report, however.</div>
                            <div><br>
                            </div>
                            <div>First of all, I confirmed that a build
                              against GDAL 3.4.1 (the version we were on
                              before) still works. I also confirmed that
                              builds against 3.7.3 and 3.8.4 still
                              failed even with no additional library
                              dependencies (just sqlite3 and proj), in
                              case it was a side-effect of us also
                              adding more of those. I then tried 3.5.3,
                              with the CMake build (same config as we
                              use for 3.7.3) and that worked. I then
                              tried 3.6.4 (again, same CMake config) and
                              that failed. These were all from bundles.</div>
                            <div><br>
                            </div>
                            <div>I then started delving through the GDAL
                              repo itself. I found the common root
                              commit of 3.5.3 and 3.6.4, and all the
                              commits in the <b>ogr/ogrsf_frmts/flatgeobuf</b> sub-project
                              between that one and the final of each.
                              For 3.5.3, this was only two. I built and
                              tested both, and they were fine. I then
                              tried the very first one that was new in
                              the 3.6.4 chain (not in the history of
                              3.5.3), which was actually a bulk update
                              to the <b>flatbuffers</b> sub-library,
                              committed by Bjorn Harrtell on May 8 2022
                              (SHA f7d8876). That one had the issue. I
                              then tried the immediately-preceding
                              commit (an unrelated docs change) and that
                              one was fine.</div>
                            <div><br>
                            </div>
                            <div>My current hypothesis, therefore, is
                              that the <b>flatbuffers</b> update
                              introduced the issue, or at least, the
                              susceptibility of the issue.</div>
                            <div><br>
                            </div>
                            <div>I still cannot explain why it only
                              occurs in an all-static build, and even
                              less able to explain why it only occurs in
                              our full system and not with the simple
                              test program against the very same static
                              lib build that does the very same sequence
                              of GDAL API calls, but I repeated the
                              build tests of the commits either side and
                              a few other random ones a bit further away
                              in each direction, and the results were
                              consistent. Again, it happens with both
                              GCC 11 and Clang 14 builds, Debug or
                              Release.<br>
                            </div>
                            <div><br>
                            </div>
                            <div>I will continue tomorrow to look at the
                              actual changes to <b>flatbuffers</b> in
                              that update, although they are quite
                              significant. Certainly, the <b>vector_downward</b>
                              class, which is directly involved, was a
                              new file in that update (although on
                              inspection of that file's history in the <b>google/flatbuffers</b>
                              repo, it seems it was just split out of
                              another header).</div>
                            <div><br>
                            </div>
                            <div>Bjorn, I don't mean to call you out
                              directly, but I am CC'ing you to ensure
                              you see this, as you appear to be a
                              significant contributor to the <b>flatbuffers</b>
                              project itself. Any insight you may have
                              would be very welcome. I am of course
                              happy to describe my debugging findings in
                              more detail, privately if you wish, rather
                              than spamming the list.</div>
                            <div><br>
                            </div>
                            <div>Simon</div>
                            <div><br>
                            </div>
                            <div><br>
                            </div>
                            <div><br>
                            </div>
                            <div><br>
                            </div>
                            <div><br>
                            </div>
                          </div>
                          <br>
                          <div class="gmail_quote">
                            <div dir="ltr" class="gmail_attr">On Tue,
                              Feb 20, 2024 at 1:49 PM Robert Coup <<a href="mailto:robert.coup@koordinates.com" target="_blank">robert.coup@koordinates.com</a>>
                              wrote:<br>
                            </div>
                            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                              <div dir="ltr">
                                <div dir="ltr">Hi,</div>
                                <div dir="ltr"><br>
                                </div>
                                <div dir="ltr">On Tue, 20 Feb 2024 at
                                  21:44, Robert Coup <<a href="mailto:robert.coup@koordinates.com" target="_blank">robert.coup@koordinates.com</a>>
                                  wrote:<br>
                                </div>
                                <div class="gmail_quote">
                                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                    <div dir="ltr">
                                      <div>Hi Simon,</div>
                                      <div><br>
                                      </div>
                                      <div class="gmail_quote">
                                        <div dir="ltr" class="gmail_attr">On Tue, 20
                                          Feb 2024 at 21:11, Simon Eves
                                          <<a href="mailto:simon.eves@heavy.ai" target="_blank">simon.eves@heavy.ai</a>> wrote:<br>
                                        </div>
                                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                          <div dir="ltr">Here's the
                                            stack trace for the original
                                            assert. Something is
                                            stepping on scratch_ to make
                                            it 0x1000000000 instead of
                                            null, which it starts out as
                                            when the flatbuffer object
                                            is created, but by the time
                                            it gets to allocating
                                            memory, it's broken.</div>
                                        </blockquote>
                                        <div><br>
                                        </div>
                                        What happens if you set a
                                        watchpoint in gdb when the
                                        flatbuffer is created?
                                        <div><br>
                                        </div>
                                        <div><span style="color:rgb(0,0,0)"><font face="monospace">watch -l
                                              myfb->scratch</font></span></div>
                                        <div><span style="color:rgb(0,0,0)">or </span><span style="color:rgb(0,0,0);font-family:monospace">watch *0x1234c0ffee</span></div>
                                      </div>
                                    </div>
                                  </blockquote>
                                  <div><br>
                                  </div>
                                  <div dir="ltr">Or I've also had
                                    success with Mozilla's rr: <a href="https://rr-project.org/" target="_blank">https://rr-project.org/</a>
                                    — you can run to a point where
                                    scratch is wrong, set a watchpoint
                                    on it, and then run the program
                                    backwards to find out what touched
                                    it.</div>
                                  <div dir="ltr"><br>
                                  </div>
                                  <div>Rob :)</div>
                                </div>
                              </div>
                            </blockquote>
                          </div>
                        </blockquote>
                      </div>
                      <br>
                      <fieldset></fieldset>
                      <pre>_______________________________________________
gdal-dev mailing list
<a href="mailto:gdal-dev@lists.osgeo.org" target="_blank">gdal-dev@lists.osgeo.org</a>
<a href="https://lists.osgeo.org/mailman/listinfo/gdal-dev" target="_blank">https://lists.osgeo.org/mailman/listinfo/gdal-dev</a>
</pre>
                    </blockquote>
                    <pre cols="72">-- 
<a href="http://www.spatialys.com" target="_blank">http://www.spatialys.com</a>
My software is free, but my time generally not.</pre>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <pre cols="72">-- 
<a href="http://www.spatialys.com" target="_blank">http://www.spatialys.com</a>
My software is free, but my time generally not.</pre>
  </div>

</blockquote></div>
</blockquote></div>