[gdal-dev] Errors when reading large xlsx files

Even Rouault even.rouault at spatialys.com
Tue Mar 29 12:25:07 PDT 2022


Le 29/03/2022 à 20:29, Dirk Vanden Boer a écrit :
> > The effect will at least be to ignore any rows for which this 
> message was raised - the function is unconditionally exited after the 
> error is raised, before a new feature is added to the current layer.
>
> So do I understand correctly that for files containing roughly more 
> than 100000 lines, rows that contain more columns of data than the 
> detected headers are not readable?
> Because if that is the case I will be required to patch my gdal 
> version to not skip these lines.
Please file an issue about that at https://github.com/OSGeo/gdal/issues
>
> Regards,
> Dirk
>
> On Tue, Mar 29, 2022 at 8:09 PM Daniel Evans 
> <daniel.fred.evans at gmail.com> wrote:
>
>     > does the error impact the returned data?
>
>     The effect will at least be to ignore any rows for which this
>     message was raised - the function is unconditionally exited after
>     the error is raised, before a new feature is added to the current
>     layer.
>
>     > Is there a way to suppress this error without disabling the gdal
>     log handling. My logs are flooded with these messages, modifying
>     the xlsx files is not an option because there are many and they
>     are supplied by clients and regularly updated.
>
>     I suspect the only way is by providing GDAL with a custom error
>     handler, which ignores this specific message and otherwise
>     delegates back to CPLDefaultErrorHandler() (or prints to stderr
>     itself).
>
>     Regards,
>     Daniel
>
>     On Tue, 29 Mar 2022 at 09:20, Dirk Vanden Boer
>     <dirk.vdb at gmail.com> wrote:
>
>         Scanning through the file, it turns out 2 lines actually have
>         a value in the eight column, that's why the column is present,
>         it doesn't have a header for that column however.
>
>         So I have 2 questions:
>         - does the error impact the returned data?
>         - Is there a way to suppress this error without disabling the
>         gdal log handling. My logs are flooded with these messages,
>         modifying the xlsx files is not an option because there are
>         many and they are supplied by clients and regularly updated.
>
>         Regards,
>         Dirk
>
>         On Tue, Mar 29, 2022 at 10:06 AM Daniel Evans
>         <daniel.fred.evans at gmail.com> wrote:
>
>             Hi Dirk,
>
>             > I do notice when I open the file in excel and select
>             everything, the eight column in the file is empty but also
>             gets selected.
>
>             It looks like that's the key here.
>
>             The code you identified gets hit if GDAL encounters a row
>             with more populated columns than the previous one, and if
>             the product of (previous numbers of rows read) x (number
>             of columns to be added) is too high (>100,000), GDAL gives
>             the error you're getting. That functionality was added in
>             commit 4f3f1fa [1], in response to an OSSFuzz
>             vulnerability report noting that GDAL becomes very slow if
>             an Excel file adds many extra columns after reading many
>             rows already (presumably as it has to modify every feature
>             already read). I think this is where Even would start
>             pointing out that there's downsides to such automated
>             security scanners, as the distinction between "it's just
>             slow for large files" (>25s in the report) and "an actual
>             DOS attack" is awkward when dealing with typical GIS data
>             volumes.
>
>             Are you sure the 8th column contains no data at all? Even
>             if it is empty, my experience is that Excel can be pretty
>             stubborn about saving empty columns that have contained
>             data at some point in the file's history. From memory,
>             selecting the whole column, deleting it, and saving again
>             usually convinces Excel to no longer save it.
>
>             Regards,
>             Daniel
>
>             [1]
>             https://github.com/OSGeo/gdal/commit/4f3f1facc5da0eeac71f6b1ba946b7618386ee7d
>
>             On Tue, 29 Mar 2022 at 08:41, Dirk Vanden Boer
>             <dirk.vdb at gmail.com> wrote:
>
>                 Hi,
>
>                 When reading xlsx files that contains a lot of lines
>                 gdal reports the following error multiple times:
>                 | Adding too many columns to too many existing features
>
>                 It comes from the the xlsx driver:
>                 GIntBig nFeatureCount =
>                 poCurLayer->GetFeatureCount(false);
>                 if( nFeatureCount > 0 &&
>                 static_cast<size_t>(apoCurLineValues.size() -
>                 poCurLayer->GetLayerDefn()->GetFieldCount()) >
>                             static_cast<size_t>(100000 / nFeatureCount) )
>                 {
>                     CPLError(CE_Failure, CPLE_NotSupported,
>                                 "Adding too many columns to too many "
>                                 "existing features");
>                     return;
>                 }
>
>                 The featureCount in my case is 128741
>                 apoCurLineValues.size() = 8
>                 fieldCount = 7
>
>                 Why is this error reported? Does it impact the actual
>                 read data?
>                 I do notice when I open the file in excel and select
>                 everything, the eight column in the file is empty but
>                 also gets selected.
>
>                 Kind regards,
>                 Dirk
>                 _______________________________________________
>                 gdal-dev mailing list
>                 gdal-dev at lists.osgeo.org
>                 https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20220329/b9088e9e/attachment.html>


More information about the gdal-dev mailing list