[GRASS-user] Large vector files

Jonathan Greenberg jgreenberg at arc.nasa.gov
Sun Oct 8 15:22:59 EDT 2006

Hamish, this was a great post, thanks!  I want to give some examples of what
I'd like to do with this data to be more clear why I think a vector
environment that can handle massive vectors seems to be a requirement (and
not trying a raster analog)...

Remember that the base dataset is a set of points with a radius parameter
and represent the positions and sizes of tree crowns, e.g. X,Y,crown radius.
We often work with "management polygons" for US Forest Service applications
which are the units of management and the base data layer to be analyzed (on
the scale of many hectares, so its a much smaller coverage to work with) --
so we want to create summary stats based on our tree points at the scale of
the management polygons:

1) What management polygon does each tree belong to (spatial join b/t
massive points and management polygon layer).  What the is the tree count
per polygon?  What is the distribution of sizes of trees in each polygon?
2) What is the tree cover within a polygon -- at a first glance you'd think
I'd just convert the radius to area, and sum all areas from the previous
step for a given management polygon -- but tree crowns can overlap and the
overlapping area does NOT get counted twice -- so we need to do a spatial
dissolve on a BUFFERED set of tree POLYGONS (we can't work with points), and
then a spatial clip based on the management polguon layer so if any trees
are partially in one poly and partially in the other, we deal with that.
3) What is the distance from every tree to the nearest tree and, at a
management polygon level, what is the distribution of these minimum-tree
distances (this is relevant for fire ecology work)?

These are all classic vector problems, with the added issue that I'm dealing
with > 7 million trees.


On 10/7/06 10:51 PM, "Hamish" <hamish_nospam at yahoo.com> wrote:

> Eric Patton wrote:
>> I would _highly_recommend trying r.in.xyz if you have not already done
>> so. Especially with LIDAR and other forms of remotely-sensed data.
>> I've had good success with it. Note there is also a parameter in
>> r.in.xyz to control how much of the input map to keep in memory,
>> allowing you to run the data import in multiple passes.
> note the r.in.xyz memory parameter is to help with massive RASTER
> regions, nothing directly to do with the size of the input file.
> The input file is not kept in memory! The output raster grid is.
> I am always looking for feedback on how r.in.xyz goes with massive input
> data. (>2gb? >4gb?)
> Jonathan Greenberg wrote:
>> -But maybe the most important conclusion I've come to for working
>> with really large data sets is that files are not the way to go and
>> that a database serving the application manageable chunks of data is
>> a better option.
> I have not yet met a dataset that .csv + awk couldn't handle in an
> efficient way. Simple, fast, no bells and whistles to cause problems.
> JG:
>> Unfortunately, I was hoping to work in a vector environment with the
>> data -- I'm sure I could think up raster analogs to the analyses I'm
>> trying to do right now,
> I am interested to learn of a form of processing couldn't be handled by
> SQL query+expression, awk pre-processing, or by raster analog.
> Is your need something that the GRASS 5 sites format could handle or
> something more sophisticated? (grass 5 sites format is just a text file,
> as big as UNIX can handle)
> JG:
>> I am hearing a lot of suggestions about using things like PostGIS and
>> PostGRESQL here and elsewhere,
> PostGIS seems to be widely recommended for massive datasets..
> (tip: PostGIS is just PostgreSQL with a plugin)
> JG:
>> is there a "dummy's guide" to working with these DB instead of
>> shapefiles?
> there is this help page, but no tutorial I know of:
>  http://grass.ibiblio.org/grass63/manuals/html63_user/databaseintro.html
> Probably lots of generic DB tutorials out there. Maybe this would make
> for a nice OSGeo doc project as this is not a GRASS specific need.
> Dylan Beaudette wrote:
>> contact me if you would like some tips on PostGIS, I use it all the
>> time for massive soil survey based analysis.
> A "from scratch" tutorial on setting this up would make a /very/ nice
> help page in the GRASS wiki or a GRASSNews article.
> JG:
>> Is it possible to simply substitute some postgres driven vector DB for
>> a GRASS vector in the GRASS algorithms, or do the v.[whatever]
>> algorithms need to be reworked to support this?
> GRASS vector coordinate info needs to be in GRASS vector format (or
> live-translated with v.external). GRASS vector attributes are stored in
> the DB of your choice. v.* modules don't care what DB you are using
> (they do pass through limitations of the selected DB though [e.g. DBF
> column name length]).
> If your (large) data is just x,y,z (or x,y,[value]) it is probably best
> to skip creating an empty attribute table, there's no need for it.
> If you want to access a large dataset without importing to GRASS, use
> v.external. I notice it can use these drivers: "..,CSV,Memory,..".
> (What's "memory"?) Restating something Moritz has mentioned, I suspect
> you'll encounter problems when GRASS tries to build a map which is a
> derivative of your external data (creating a new massive GRASS vector).
> in summary,
> The best method I can suggest for more than 3 million points is PostGIS
> or .csv+awk for storage and extraction; v.external for simple "GIS"
> cartography of the existing dataset; and r.in.xyz when you want to
> really use the dataset as a whole (upon transition from raw to
> generalized data needs). And skip using an attribute table if you don't
> need one.
> I don't know the best way to pass the data to GRASS's R-stats interface.
> (GRASS 5 sites format? directly (skipping GRASS)?)
> Hamish

Jonathan A. Greenberg, PhD
NRC Research Associate
NASA Ames Research Center
MS 242-4
Moffett Field, CA 94035-1000
Office: 650-604-5896
Cell: 415-794-5043
AIM: jgrn307
MSN: jgrn307 at hotmail.com

More information about the grass-user mailing list