<div dir="ltr">Here's the stack trace for the original assert. Something is stepping on scratch_ to make it 0x1000000000 instead of null, which it starts out as when the flatbuffer object is created, but by the time it gets to allocating memory, it's broken.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Feb 20, 2024 at 1:05 PM Simon Eves <<a href="mailto:simon.eves@heavy.ai">simon.eves@heavy.ai</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">(starting a new thread to avoid derailing the static-build one any further)<br><br>Totally agreed on the mismatch idea, but the code in question is all self-contained down in <b>ogr/ogrsf_frmts/flatgeobuf</b> and the <b>flatbuffers</b> sub-project (which is a snapshot of a Google OSS project) so I'm struggling to see how there could be a mismatch.<div><br></div><div>Also, although we're building on CentOS 7, we're using relatively new compilers (GCC 11.4 and Clang 14.0.6), and we bundle the matching newer runtimes.</div><div><br></div><div>We don't have a full static build stack on our normal dev platform (Ubuntu 22.04) so I haven't been able to repro the problem there.</div><div><br></div><div>I should have mentioned the first time that we have tried using ASAN, and it definitely catches something wrong, but the behavior is different, and varies if you add more debug printfs. For example:</div><div><br></div><div>DEBUG: vector_downward::push() num = 16<br>DEBUG: about to reallocate, buf_ = 0, cur_ = 0, scratch = 0<br>DEBUG: reallocated, buf_ = 0x61900062d380, cur_ = 0x61900062cf80, scratch = 0<br>DEBUG: vector_downward::push() ptr = 0x61900062cf70, about to do memcpy<br>=================================================================<br>==25459==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61900062cf70 at pc 0x7f8933eb87f6 bp 0x7fffa7aa0e70 sp 0x7fffa7aa0620<br>WRITE of size 16 at 0x61900062cf70 thread T0<br></div><div><br></div><div>...but it's still not obvious what exactly is going wrong. The code and data flow makes perfect sense when you step through it in a dynamic build that doesn't fail.</div><div><br></div><div>Like I said, the frustrating part is that a simple test program (attached) compiled against the same set of static libs works fine.</div><div><br></div><div>S<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Feb 20, 2024 at 12:33 PM Robert Coup <<a href="mailto:robert.coup@koordinates.com" target="_blank">robert.coup@koordinates.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Simon,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 20 Feb 2024 at 18:58, Simon Eves via gdal-dev <<a href="mailto:gdal-dev@lists.osgeo.org" target="_blank">gdal-dev@lists.osgeo.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>We still have one VERY strange issue whereby FlatGeoBuf export fails in a very consistent and reproducible form down in the flatbuffer code, but only in the static build, and only in the full system. I have written a simple test harness that links the very same static libgdal and does a simple GDAL startup and FGB export of a single feature and that works fine. It's some kind of data/stack corruption when it first tries to write to the flatbuffer on the first feature, which results in a pointer member of the buffer class becoming 0x100000000000 (always) instead of null, and then it stops on an assert. There is also one private function in the vector_downward class which the debugger won't even step into in that build. I can even put printfs in that function and they don't come out. I've tried it on CentOS and on Ubuntu, with GCC and Clang, and it's always the same. Everything else in GDAL works just fine (we have LOTS of import/export unit tests). This makes zero sense as all the FGB code is internal to GDAL and compiled together. I've been poking at it for over a week and it's doing my head in.</div></div></blockquote><div><br></div><div>One cause of this sort of crash is a header/library mismatch somewhere where a function is expecting different parameters/types than the caller is actually providing. Otherwise, maybe a bug in glibc/libstdc++/gcc/something that's been fixed in the intervening ten years since CentOS 7 was released?<span> </span></div><div><br></div><div>If you run your <u>build</u> on a modern distro/libc/gcc/etc does it change things? If it's the same, maybe hints more towards the former. </div><div><br></div><div>ASAN (<a href="https://github.com/google/sanitizers/wiki/AddressSanitizer" target="_blank">https://github.com/google/sanitizers/wiki/AddressSanitizer</a>) might help track down stack/heap corruption.</div><div><br></div><div>Rob :)</div><div><br></div></div></div>
</blockquote></div></div>
</blockquote></div>