[gdal-dev] Errors when reading large xlsx files
Even Rouault
even.rouault at spatialys.com
Tue Mar 29 12:25:07 PDT 2022
Le 29/03/2022 à 20:29, Dirk Vanden Boer a écrit :
> > The effect will at least be to ignore any rows for which this
> message was raised - the function is unconditionally exited after the
> error is raised, before a new feature is added to the current layer.
>
> So do I understand correctly that for files containing roughly more
> than 100000 lines, rows that contain more columns of data than the
> detected headers are not readable?
> Because if that is the case I will be required to patch my gdal
> version to not skip these lines.
Please file an issue about that at https://github.com/OSGeo/gdal/issues
>
> Regards,
> Dirk
>
> On Tue, Mar 29, 2022 at 8:09 PM Daniel Evans
> <daniel.fred.evans at gmail.com> wrote:
>
> > does the error impact the returned data?
>
> The effect will at least be to ignore any rows for which this
> message was raised - the function is unconditionally exited after
> the error is raised, before a new feature is added to the current
> layer.
>
> > Is there a way to suppress this error without disabling the gdal
> log handling. My logs are flooded with these messages, modifying
> the xlsx files is not an option because there are many and they
> are supplied by clients and regularly updated.
>
> I suspect the only way is by providing GDAL with a custom error
> handler, which ignores this specific message and otherwise
> delegates back to CPLDefaultErrorHandler() (or prints to stderr
> itself).
>
> Regards,
> Daniel
>
> On Tue, 29 Mar 2022 at 09:20, Dirk Vanden Boer
> <dirk.vdb at gmail.com> wrote:
>
> Scanning through the file, it turns out 2 lines actually have
> a value in the eight column, that's why the column is present,
> it doesn't have a header for that column however.
>
> So I have 2 questions:
> - does the error impact the returned data?
> - Is there a way to suppress this error without disabling the
> gdal log handling. My logs are flooded with these messages,
> modifying the xlsx files is not an option because there are
> many and they are supplied by clients and regularly updated.
>
> Regards,
> Dirk
>
> On Tue, Mar 29, 2022 at 10:06 AM Daniel Evans
> <daniel.fred.evans at gmail.com> wrote:
>
> Hi Dirk,
>
> > I do notice when I open the file in excel and select
> everything, the eight column in the file is empty but also
> gets selected.
>
> It looks like that's the key here.
>
> The code you identified gets hit if GDAL encounters a row
> with more populated columns than the previous one, and if
> the product of (previous numbers of rows read) x (number
> of columns to be added) is too high (>100,000), GDAL gives
> the error you're getting. That functionality was added in
> commit 4f3f1fa [1], in response to an OSSFuzz
> vulnerability report noting that GDAL becomes very slow if
> an Excel file adds many extra columns after reading many
> rows already (presumably as it has to modify every feature
> already read). I think this is where Even would start
> pointing out that there's downsides to such automated
> security scanners, as the distinction between "it's just
> slow for large files" (>25s in the report) and "an actual
> DOS attack" is awkward when dealing with typical GIS data
> volumes.
>
> Are you sure the 8th column contains no data at all? Even
> if it is empty, my experience is that Excel can be pretty
> stubborn about saving empty columns that have contained
> data at some point in the file's history. From memory,
> selecting the whole column, deleting it, and saving again
> usually convinces Excel to no longer save it.
>
> Regards,
> Daniel
>
> [1]
> https://github.com/OSGeo/gdal/commit/4f3f1facc5da0eeac71f6b1ba946b7618386ee7d
>
> On Tue, 29 Mar 2022 at 08:41, Dirk Vanden Boer
> <dirk.vdb at gmail.com> wrote:
>
> Hi,
>
> When reading xlsx files that contains a lot of lines
> gdal reports the following error multiple times:
> | Adding too many columns to too many existing features
>
> It comes from the the xlsx driver:
> GIntBig nFeatureCount =
> poCurLayer->GetFeatureCount(false);
> if( nFeatureCount > 0 &&
> static_cast<size_t>(apoCurLineValues.size() -
> poCurLayer->GetLayerDefn()->GetFieldCount()) >
> static_cast<size_t>(100000 / nFeatureCount) )
> {
> CPLError(CE_Failure, CPLE_NotSupported,
> "Adding too many columns to too many "
> "existing features");
> return;
> }
>
> The featureCount in my case is 128741
> apoCurLineValues.size() = 8
> fieldCount = 7
>
> Why is this error reported? Does it impact the actual
> read data?
> I do notice when I open the file in excel and select
> everything, the eight column in the file is empty but
> also gets selected.
>
> Kind regards,
> Dirk
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20220329/b9088e9e/attachment.html>
More information about the gdal-dev
mailing list