[gdal-dev] Improving GDAL safety with respect to hostile/corrupt files

Mon Feb 9 19:46:41 EST 2009

Hi all,

This email follows a discussion started on GDAL IRC channel and some effort I 
have began in my own tree to review and fix GDAL drivers for being more 
robust when facing invalid data which can be corrupt files (random accidental 
errors), or hostile files (after code analysis of flaws in source code, or 
use of automated tool for that task).

This effort is a bit in the continuity of 
http://trac.osgeo.org/gdal/wiki/rfc19_safememalloc

First disclaimer : this topic only tries to address GDAL drivers, as opposed 
to OGR ones (but there would certainly similar issues with OGR drivers !) and 
I focus mainly on the read part of GDAL drivers.
Second disclaimer : I'm not personnaly a specialist in software safety, just 
trying to improve GDAL's state in that area.

First observation: we have currently 96 GDAL drivers advertized in 
http://gdal.org/formats_list.html (and a 97th commited just a few hours ago), 
representing about 368 000 lines of C/C++ code, so this would be a HUGE 
effort to make an exhaustive review.

I'm mainly interested in fixing code used by GDALOpen() and that could blow up 
when fed with invalid data. After successfully opening a dataset, the 
IReadBlock() method is generally the second place to check for issues. But as 
GDALOpen() is the first step the user needs to run to be able to decide if we 
want to go on using the dataset, the Open() method of the driver is the top 
priority place to look at.

From what I have seen, security related bugs can be classified as :
- segmentation faults = application crash.
- denial of service = code doesn't return to the caller in a reasonable amount 
of time. Endless loops, or loops on a huge number of iterations, huge memory 
allocations (ones that are comparable to the amount of RAM, so that they can 
be satisfied but at the expense of massive disk swapping)....
- buffer overflows, in particular stack buffer overflows, that could be 
exploited to make hostile code embedded in a dataset run as the user of the 
process linking with GDAL. Those last ones are generally the most dangerous 
ones.
Depending on how GDAL is used and deployed, this can have different 
consequences.

My review of quite a few drivers has revealed that there were several common 
mistakes that are made and which generally sum up to a lack of validation of 
input data for a small set of critical items. They generally can be easily 
avoided by simple tests. For example, the nRasterXSize field of the 
GDALDataset object is a (signed) int. But it is invalid to set a null or 
negative value into it. This is currently done in a minority of drivers. The 
same applies for block sizes in GDALRasterBand objects too. The idea is too 
check that the dimensions declared while reading the input data are valid, 
and if not, return properly with an error value. (We could also imagine to 
change the type of nRasterXSize to unsigned, but I expect this would be 
considered a too disruptive API change --> probably hundreds of compiler 
warnings to fix, and user code not ready to deal with 4 294 967 295 instead 
of 2 147 483 647 ).

For uncompressed data, we could also check that the file size is consistant 
with the expected file size (roughly header size + width * height * nbands * 
GDALGetDataSize(eDataType) ). But, I haven't really encountered a driver that 
would behave badly from the security point of view because of that. We would 
get I/O errors which is fine I think. I also want to remind people that we 
currently have tests in the autotest suite that use volontarily truncated 
datasets to check that the header can be successfully decoded. This is 
desirable to keep the test suite data small enough to be downloaded... 

Another common issue, which is really the main point where I wanted to get 
feedback at first, is dealing with the number of bands. Currently, when a 
GDAL driver opens a dataset and has detected that it was valid, it loops over 
the number of bands that must be created to instanciate them, like in :

    for( int iBand = 1; iBand <= nBands; iBand++ )
    {
        poDS->SetBand( iBand, new FooRasterBand(poDS, iBand) );
    }

and it returns to GDALOpen() only after that. So, what happens if nBands is 
really really huge, like 2 147 483 647 ? It will spend a lot of time in 
looping and will eventually crash in an unsucessfull memory allocation after 
having exhausted the memory. Many drivers read the number of bands from a 
int32, so that can happen. 

To solve that particular case, I'd like to introduce a GDAL_MAX_BAND_COUNT 
#define (defined in some GDAL core header file, gdal_priv.h for example) with 
which we would compare the number of bands read from the file, and error out 
if necessary. I'd suggest GDAL_MAX_BAND_COUNT = 65536.  TIFF has a limit to 
65535 for example, and after a bit of discussion on IRC, it seems to be a 
reasonable limit even for hyperspectral data, where GDAL still behaves well. 
Do people agree ? Or do you have real-world use cases where this would happen 
to be a too low limit ? We can safely define it at 2147483647, and people 
wanting to restrict it for their own needs could just change it and recompile 
GDAL with the value they feel to be appropriate. Another option would be to 
make that value a configuration option that could be changed at runtime.

Phil Vachon suggested to make the dataset dynamically allocate the Band 
object. This would require changes to GDAL core, that should be designed as 
additions, and not as requiring compulsory changes to existing drivers. It 
would anyway require each driver wanting to be more secure to be changed to 
use that new capability, and I think checking the number of bands is the 
easiest way of achieving the desired result.

Other issues can be related to integer overflow in additions, multiplications, 
etc . For example, for "nBlocksPerRow = (nRasterXSize+nBlockXSize-1) / 
nBlockXSize;", after having checked that nRasterXSize > 0 and nBlockXSize > 
0, we should also check that nRasterXSize <= INT_MAX - (nBlockXSize-1). I've 
added recently that kind of checks in gdalrasterband.cpp, but that can be 
found at the driver level too.

A difficult issue to solve is the case of those big memory allocations, big 
enough to exhaust your physical RAM, but not all your virtual memory. In most 
cases, those only occur in GDAL cache block when you first begin to read data 
in a band, so it can be analysed at application level if it is reasonnable to 
deal with a dataset whose blocksize is 100 000 x 100 000. But in a few 
drivers, this allocation is also done in the constructor of the Band object. 
This could probably be restructured by moving the buffer allocation at the 
first call of IReadBlock()/IWriteBlock().
A complementary idea is that we could define, via a configuration option, an 
optional maximum allocation size for a single allocation and make VSIMalloc 
check first that we don't ask for more. For example, we want all single 
memory allocations above 500 MB to fail. It is unlikely that a valid use of 
GDAL would require to allocate 500 MB in a single time. For example, FrankW 
lately wrote an alternate GIF driver to deal specifically with GIF files 
whose uncompressed data size is above 100 MB, without allocating all that 
memory.
(some people could object that similar mechanisms exists for that, like ulimit 
on Linux, but AFAIK they extend to the whole process, not at the library 
level)

Anyway, I think all substantial memory allocations should be done with 
VSIMalloc/Calloc and the return value tested against NULL, instead of using 
CPLMalloc/Calloc which just abort() in case of failure.

Well, I'm going to stop here this long and not exhaustive list of 
recommandations. This could be material for a dedicated page on Trac.

FrankW suggested that we could add a metadata item at the driver level which 
would indicate that the driver is considered to be safe (similarly to the 
metadata item indicating that a driver makes use of the VSI*L safe 
indicator). Quotting him : "So it would be easy for folks to only register 
drivers that claim to be vetted. The reference documentation would provide 
all appropriate disclaimers.". 

Although I'm not opposed to it, I feel personnaly a bit shy about that 
proposal and think the following points would have to be solved:
1) what would be the criteria to declare a driver to be vetted ? Check that 
some precise to-be-defined rules have been followed ? Or just a general code 
review ? by one or several people ? and/or a PSC decision ?
2) how do we handle cases where third-party libraries are used ? For example, 
let's imagine that we declare the GDAL specific code of the GeoTIFF driver to 
be safe. But as it makes uses of libgeotiff and libtiff, its safety also 
depends on those two ones...
3) are we in a position to judge our own work ? doesn't it put too much 
responsibility on our shoulders, even if we write all the appropriate 
disclaimers ? Would people really read and understand the scope of what 
we "guarantee" and what we don't.

I think many drivers can be easily fixed to be safer against random data 
corruption, but being robust against a targeted attack is far more 
complicated. In some cases, we can even foresee the path the attacker would 
follow, but fixing the driver might be a lot of work, that the maintener of 
the driver is not ready to do. And as all people in the GDAL developer 
community aren't necessarily interested in that area, we cannot reasonnably 
set strict rules on contributors.

I believe it could be interesting to have a Trac page with :
1) recommandations, common errors to avoid
2) a summary of what drivers have been reviewed, on which aspects they have 
been reviewed, on which revision of the file, by whom.... (this is a bit 
redundant with a careful analysis of subversion commit logs and diffs)

Then users interested could make their own enlightened choice of which driver 
to compile / register based on those indications and their own review.

Would we also want to indicate unresolved issues ? As GDAL is open source, it 
is already fairly easy for hostile people to craft malicious files and they 
wouldn't probably learn much with those informations, except maybe some 
script kiddies discovering a new playground...

Or should we just go on as we currently do, that is to say, fix things 
incrementally without too much formalism and publicity ?

For those who have had managed to reach the end of this lengthy email, I'd be 
happy to hear their feedback...

Best regards,

Even