[gdal-dev] "RFC 75: Multidimensional arrays" available for preliminary review

Jason Roberts jason.roberts at duke.edu
Wed May 29 11:37:45 PDT 2019


Dear Even,

Thank you for exploring the possibility of making GDAL more multidimensional. In the 10+ years I've lurked on this list I recall this coming up periodically. It's a hard problem; perhaps now will be the time to actually do it!

Over that period, I have worked with a lot of 3D (x,y,z and x,y,time) and 4D (x,y,z,time) datasets from the ocean remote sensing and modeling communities. I'm a marine spatial ecologist; most of my work involves trying to explain or predict distributions of marine species from observations or models of oceanographic variables such as sea surface temperature (SST) or chlorophyll concentration. I want to describe two of the most common scenarios involving multidimensional data in the hope that it can inform your thinking about ways to enhance GDAL's API to be multidimensional. Hopefully you'll find this useful; apologies if not!

Scenario 1: Sampling a 3D or 4D raster dataset with a vector dataset:

In this scenario, we have vector features, usually points, for which we want to obtain values from a 3D or 4D raster dataset. For example, the vector features might be points representing the locations of satellite-tracked animals, of survey vessels, or hauls of fishing gear. The vectors usually have a datetime attribute, and possibly a depth attribute. The raster dataset might be a time series of satellite SST images with dimensions lon, lat, and time, or a time+depth series of water temperature from a 4D ocean model with dimensions lon, lat, depth, and time. We need to match up each point to the closest 2D slice based on time and/or depth, and then index into the image with lon and lat and extract the value of the cell to an attribute of the point. From there we can perform statistical analysis on the sampled values.

Usually the vectors are points, but occasionally we may have lines or polygons over which we'd like to summarize the raster data, e.g. find the mean value of the raster inside each polygon or along a line. This is much more complicated and, in my experience, is much less commonly done than points. Also, just as "bilinear interpolation" might be used to interpolate a value from a point on a 2D raster, we occasionally want to extend this to do "trilinear interpolation" on, say, an x,y,t image, so that the interpolated value is based on the 8 surrounding cells (or 16 surrounding cells with x,y,z,t).

Scenario 2: Summarizing 3D or 4D raster datasets into 2D summary images:

In this scenario, we have a time series of images, either x,y,t or x,y,z,t, that occur at some high temporal frequency, e.g. one image per day, and we want to statistically summarize them at a lower frequency. For example, we might want to take daily SST images and create monthly averages (365 images/year --> 12 images/year) or even monthly climatological averages (365 images/year * n years --> 12 rasters, one for each month averaged across all years).

Implications for GDAL's API:

These scenarios are very high level, and I would not expect GDAL's API to expose methods for performing them. These are geoprocessing operations that should be implemented at some higher layer that itself would make use of GDAL. So the question is: what should GDAL's API provide, if anything, to help facilitate implementation of the scenarios by that higher layer? Here are some things to consider.

1. Aggregating a series of low dimension datasets into a single virtual higher dimension dataset (e.g. 2D to 3D):

In the communities I work with, it is fairly rare for someone to release a single multidimensional file that spans more than one time slice. For example, it is rare for someone to release a single netCDF of SST with dimensions x,y,t with the length of t being 365, representing one year of daily images. Instead they will release 365 netCDFs, each containing a 2D dataset, or perhaps a 3D dataset with a single time slice. As new data are collected, they issue new files.

To use such data for scenarios 1 and 2 (and others), it is very convenient for these series of low dimension datasets to be aggregated into a single virtual high dimension dataset that can then be operated over. netCDF has long had special facilities for this, such as thredds/opendap and NcML <aggregation>. The GDAL API could expose a similar capability that would allow aggregation of arbitrary raster data in this way (rather than just netCDFs).

This capability would be a challenge to implement, no doubt, but it would provide a lot of value to users of the GDAL API. There could be some synergy with the .vrt driver. Particularly valuable would be the ability to point GDAL to a directory tree of datasets and have it aggregate them.

2. Special handling for aggregations of periodic datasets:

Many data providers produce datasets that have a regular periodicity. For example, they might release one image per day from a polar orbiting satellite. For these datasets, when performing calculations on the time dimension, it is convenient to assume that there will always be 365 time slices per year (or 366 on leap years). But, sadly, data providers occasionally experience problems and sometimes do not release time slices. Now, suddenly, there are not 365 images with the year 2017, but only 362 because 3 were not produced. Users regularly get tripped up by this. For example, in the popular HYCOM ocean model, inquiries about this happen on their mailing list every few months, prompting the HYCOM team to prepare a multi-page FAQ just on this issue.

There are also datasets that could described as semi-regular. In the oceanographic community, it is common to have rasters that span 8 days. But these datasets often "start over" on 1 January every year, such that the first dataset spans 1-8 January, the second 9-16 January, etc. The 46th one usually spans days 361-365 (or 366) or maybe includes 2-3 days of the new year, but then the next year starts again at 1 January.  

So, assuming GDAL were capable of performing aggregations, it would be useful to describe to GDAL (or have it otherwise deduce) the periodicity of datasets and then itself generate missing time slices virtually. These slices would just be filled with the "no data" value. 

It would also be useful for the API provide an abstraction for the concept of periodicity, so that a caller can get/set it. I fully recognize this is difficult, in that there are so many ways that datasets can be periodic, and that it might not be possible to invent a scheme for representing them all. But there are many datasets that fall under common schemes (e.g. one image every day, 8-days, month, year, etc.) and this might be a situation in which a simple solution can cover 95% of the datasets out there and provide good value.

A lot of these issues about time have been considered before by others, e.g. by those developing netCDF metadata standards. This raises the question: should GDAL even consider getting involved in this area? Isn't it really tricky? Yes, probably. But if GDAL treats time completely generically, the new API might not provide that much value of what can already be done, e.g. with subdatasets. (But I do recognize that this is a potential minefield.)

3. Lazy evaluation:

If an aggregate dataset is spread across many files, it would be good if GDAL could avoid iterating over all of those files unless absolutely necessary. For example, it would be unfortunate if GDAL could only determine the time or depth coordinate of a file by opening it and reading a netCDF attribute. Consider a 3D dataset of daily SST images composed of 10,000 netCDF files. It would be very slow if GDAL had to open each one in order to return an array of time dimension values. In some cases this might be unavoidable, but GDAL should provide the means to avoid it, if possible, e.g. by telling GDAL how to parse the times from the file names. You are probably well aware of these issues in developing the .vrt driver and similar projects.

I hope this helps. Thanks for your time and consideration.

Jason

> -----Original Message-----
> From: gdal-dev <gdal-dev-bounces at lists.osgeo.org> On Behalf Of Even
> Rouault
> Sent: Friday, May 24, 2019 11:01 AM
> To: gdal-dev at lists.osgeo.org
> Subject: [gdal-dev] "RFC 75: Multidimensional arrays" available for
> preliminary review
> 
> Hi,
> 
> I've prepared a preliminary version of a new RFC to add support for
> multidimensional arrays
> 
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_rouault_gdal_blob_rfc75-
> 5Ftext_gdal_doc_source_development_rfc_rfc75-5Fmultidimensional-
> 5Farrays.rst&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=cJfJ
> 4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7-
> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-088&s=QoB-
> _CuLOE_hxDlvaGk9pf_1E6LVycCraOPSTwANhCk&e=
> 
> If you want to comment it inline, it is also available as a pull request
> at https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_OSGeo_gdal_pull_1580&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1
> i6YHLR0Sj_gZ4adc&r=cJfJ4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7-
> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-
> 088&s=5ZZMNYskn0v4Jzk4QvfTKYSArMmQZEZTTiH3uxtCVAI&e=
> 
> This is preliminary content to give an idea of the general directions I
> have in mind for now.
> Not backed by any implementation yet (should start soon hopefully), so API
> details will likely change, but early feedback is welcome.
> 
> Even
> 
> --
> Spatialys - Geospatial professional services
> https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__www.spatialys.com&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4a
> dc&r=cJfJ4ejc1xbg_qb47Pg1OoRq1GfFGvWbDD2PT7-
> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-088&s=vEHFxJLIKrMqTFFYqD_x-
> oa1BtTCt8ppUhHU1HakMoA&e=
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lists.osgeo.org_mailman_listinfo_gdal-
> 2Ddev&d=DwIGaQ&c=imBPVzF25OnBgGmVOlcsiEgHoG1i6YHLR0Sj_gZ4adc&r=cJfJ4ejc1xb
> g_qb47Pg1OoRq1GfFGvWbDD2PT7-
> fBKk&m=z4DIC0ByQfvEPOoQULwLWJUCYSJ_rD9wnK56bSk-
> 088&s=tUoKHatLSi1xtfDf1EYAGTrYwf99aV9Ky_Sa95QU54w&e=


More information about the gdal-dev mailing list