[GRASS-user] RE: [GRASSLIST:1174] Working with very large data sets

Thu Aug 24 02:50:21 EDT 2006

David Finlayson wrote:
> I am working with an interferometric sidescan SONAR system that
> produces about 2 Gb of elevation and amplitude data per hour. Our raw
> data density could support resolutions up to 0.1 m, but we currently
> can't handle the data volume at that resolution so we decimate down to
> 1 m via a variety of filters. Still, even at 1 m resolution, our
> datasets run into the hundreds of Mb and most current software just
> doesn't handle the data volumes well. 
>
> Any thoughts on processing and working with these data volumes (LIDAR
> folks)? I have struggled to provide a good product to our researchers
> using both proprietary (Fledermaus, ArcGIS) and non-proprietary (GMT,
> GRASS, my own scripts) post-processing software. Nothing is working
> very well. The proprietary stuff seems easier at first, but becomes
> difficult to automate. The non-proprietary stuff is easy to automate,
> but often can't handle the data volumes without first down sampling
> the data density (GMT does pretty well if you stick to line-by-line
> processing, but that doesn't always work).
>
> Just curious what work flows/software others are using. In particular,
> I'd love to keep the whole process FOSS if possible. I don't trust
> black boxes.

I am wondering if you have trialed r.in.xyz much, and if that does
not meet your needs what the problem is? Maybe we can fix it.

General method to fill gaps could be:

r.in.xyz
r.to.vect -b
v.surf.rst

see  http://hamish.bowman.googlepages.com/grassfiles#xyz

Real-time display of data as it is collected is another matter.

Jonathan Greenberg wrote:
> We've worked with a ~40gb pansharpened 1m image of the Lake
> Tahoe Basin using RSI ENVI - ENVI will support essentially unlimited
> file sizes on Windows and many unix boxes, and the next version (4.3)
> will have LFS on MacOS X as well.  I honestly don't know how GRASS
> handles big datasets (I'm sure someone will respond), but a "good"
> algorithm basically just performs the processing using subsets of the
> data - ENVI processes images on a per-line basis, so you never really
> have much of a memory hit, although it clearly takes a long time to
> process a 40gb file.  ESRI products are completely useless for large
> files, in fact I'm pretty sure they simply can't deal with any file 
> > 2gb, and their routines are VERY inefficient. The other issues are
> whether an OS can actually open a large file (e.g. MacOS X pre-tiger
> could not), and how easy it is to use an MP system (e.g. ENVI will
> just use a MP system out of the box, but I don't think GRASS can).
>
> If you have some $$$ for hardware, I/O with image processing is
> pretty important as well - for a small system, we found a RAID0
> "scratch" drive was a good addition - the extremely high I/O really
> helps processing low CPU algorithms (e.g. basic raster calculations). 
> It's also very unstable (one drive failure will cause data failure
> across all drives in the RAID), so you do have to be careful to backup
> your system.
>
> Anyway, my two cents!  Along these lines, how DOES GRASS do raster
> processing?  I feel like it does use tiled processing like I described
> for mapcalc and most of the other algorithms - I never see any major
> forks in memory usage.

* The vast majority of GRASS modules work with data on a row by row
basis. Thus only one row of data is in memory at a time and memory use
isn't an issue. Raster modules which need to have many rows in memory at
the same time will generally have a percent= or lines= parameter and
process the data in a series of slices.

* Since about GRASS 6.0 there has been support for Large File Systems
(LFS) (ie >2gb file sizes), assuming your hardware+OS supports it.
Configure GRASS with "./configure --enable-largefile" to use it, and
make sure your GDAL libraries were built with it as well. For 64bit and
LFS support you really should be using GRASS 6.1.0, as the process of
updating the code has been "fix problems as they arise" as opposed to a
full audit of GRASS's nearly 1 million lines of code.

* There are ideas/plans to rewrite the raster format for GRASS 7 using a
tiled model. 
http://grass.gdf-hannover.de/wiki/Replacement_raster_format

* Vector operations when dealing with several million features can be
quite memory intensive. Several work-arounds have been put in place to
deal with this though. The only datasets I know of where this is an
issue are vector/point data from LIDAR or swath sonar. By skipping the
creation of a database and leaving topology unbuilt, massive datasets
can be pulled in. I don't think we've found the cap to that yet. Maybe
Helena & Andrew have found it? Core modules for dealing with such
data have been modified to deal with the topology-free case (i.e.
v.surf.rst). LFS is supported by vectors AFAIK (well, as much as
anywhere else).

In summary, with GRASS you will ususally be limited by hard drive space
or computational time. (or at least that is the goal)

> Has anyone gotten GRASS working with an MP setup for things like
> mapcalc?

GRASS is a group of modular little programs, so it is often possible to
spawn off processes, but it is not threaded. i.e. a single module will
not use both processors at once. There have been some efforts to add
threading support to some GRASS modules, e.g. parallelized s.surf.idw:
  http://grass.itc.it/download/addons.php

Hamish