[gdal-dev] Errors when reading large xlsx files

Daniel Evans daniel.fred.evans at gmail.com
Tue Mar 29 01:06:17 PDT 2022


Hi Dirk,

> I do notice when I open the file in excel and select everything, the
eight column in the file is empty but also gets selected.

It looks like that's the key here.

The code you identified gets hit if GDAL encounters a row with more
populated columns than the previous one, and if the product of (previous
numbers of rows read) x (number of columns to be added) is too high
(>100,000), GDAL gives the error you're getting. That functionality was
added in commit 4f3f1fa [1], in response to an OSSFuzz vulnerability report
noting that GDAL becomes very slow if an Excel file adds many extra columns
after reading many rows already (presumably as it has to modify every
feature already read). I think this is where Even would start pointing out
that there's downsides to such automated security scanners, as the
distinction between "it's just slow for large files" (>25s in the report)
and "an actual DOS attack" is awkward when dealing with typical GIS data
volumes.

Are you sure the 8th column contains no data at all? Even if it is empty,
my experience is that Excel can be pretty stubborn about saving empty
columns that have contained data at some point in the file's history. From
memory, selecting the whole column, deleting it, and saving again usually
convinces Excel to no longer save it.

Regards,
Daniel

[1]
https://github.com/OSGeo/gdal/commit/4f3f1facc5da0eeac71f6b1ba946b7618386ee7d

On Tue, 29 Mar 2022 at 08:41, Dirk Vanden Boer <dirk.vdb at gmail.com> wrote:

> Hi,
>
> When reading xlsx files that contains a lot of lines gdal reports the
> following error multiple times:
> | Adding too many columns to too many existing features
>
> It comes from the the xlsx driver:
> GIntBig nFeatureCount = poCurLayer->GetFeatureCount(false);
> if( nFeatureCount > 0 &&
>     static_cast<size_t>(apoCurLineValues.size() -
>         poCurLayer->GetLayerDefn()->GetFieldCount()) >
>             static_cast<size_t>(100000 / nFeatureCount) )
> {
>     CPLError(CE_Failure, CPLE_NotSupported,
>                 "Adding too many columns to too many "
>                 "existing features");
>     return;
> }
>
> The featureCount in my case is 128741
> apoCurLineValues.size() = 8
> fieldCount = 7
>
> Why is this error reported? Does it impact the actual read data?
> I do notice when I open the file in excel and select everything, the eight
> column in the file is empty but also gets selected.
>
> Kind regards,
> Dirk
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20220329/f9701476/attachment.html>


More information about the gdal-dev mailing list