[postgis-users] Massive Lidar Dataset Datatype Suggestions?

Sat Nov 13 16:45:14 PST 2004

Nice Hardware...

On 13-Nov-04, at 3:09 PM, collin wrote:

> The server we will be using for production is a Sun FireV440 2 
> processor 1.7ghz UltraSPARCs smp, 16GB ram, 1 Terrabyte SCSI320 RAID 
> system.  My test machine was just what I had availible to play with.
>
> Some responses below.
>
> Paul Ramsey wrote:
>> On 13-Nov-04, at 12:34 PM, collin wrote:
>>> I am trying to figure the best setup for storing, extracting and 
>>> processing this dataset.  btw, it is a smallish dataset. We will be 
>>> processing 2 billion+ point projects in the near future.
>> The key here is "the best setup for storing, extracting and 
>> processing". You are talking about non-trivial amounts of data and 
>> processing tasks, so the decisions you make about storage will have 
>> large downstream effects. The "right decisions" are dictated by what 
>> you are actually going to *do* with the data.  How are you going to 
>> be querying it?  What variables and variable combinations will you be 
>> using?
>
> The primary use will be for internal processing, i.e. point 
> classification.  I am hoping to do this by performing windowing 
> functions through the dataset from plpgsql.  Queryies in this form are 
> mostly bounding boxes and I'll be mostly looking processing Z values 
> and intensity.
>
> For the occaisonal extraction, we will be extracting by watershed 
> polygon.  This system would not be live to the public nor is it going 
> to  have multiple transactions occuring simultaneously (much).
>
> I am uncertain what to do with the first and lost returns, since they 
> are both 3D points.  Can you have two separate geometry columns in one 
> row?

Yes, no problems there.

>> Do you really need to store every point as a separate row, for 
>> example?  One "easy" way to cut down your storage and index size 
>> would be to store your LIDAR points as MULTIPOINT patches. Simply cut 
>> up your working plane with an arbitrary grid system and patch the 
>> points together based on their x/y values and what grid cell they 
>> fall into.  Depending on the importance of the extra point attributes 
>> for your downstream processing plans, this simplification might be a 
>> very smart one.
>
> This is an interesting idea, but elevation and intensity are the 
> primary information we use.  Also which flight lines the points came 
> from and whether the point is first or last return.  So I can't see 
> multipoint helping much, unless I'm misunderstanding you.

Well, you could build patches based on homogeneity within the variables 
you are likely to subset with. But then building the patches themselves 
could be a significant effort.

> Would some form of physical indexing help? i.e. creating a new table 
> where the points are inserted by proximity?  Or will the indexes work 
> sufficiently well to make this unnecessary?  I ask this, because I am 
> not yet convinced placing the points into a database is necessarily 
> the right way to go.

Well, it will help once stuff is loaded, I am just concerned about the 
logistics of the size of the index associated with this data.
Remember, you can use the CLUSTER command on your spatial index, so 
that will organize your data on disk according to the index structure, 
which should speed up reads somewhat. It will take a while to do 
though.
I am not necessarily convinced either. It will certainly be easier in 
the short run, but if you plan on doing a great deal of this kind of 
processing, a custom access system might well be faster (less flexible, 
but that is the usual trade-off).  Your initial tests on LWGEOM should 
give you an idea about how much to expect from the database.

Paul

>> Really, the key is what your downstream processing regime will be. 
>> Regardless, get the LWGEOM working, HWGEOM is really inappropriate 
>> for point data.
>> Paul
>
> I agree completely.  Getting LWGEOM working is my goal for the weekend 
> :-)