[GRASS5] Multiple attribute support in GRASS 5.1: some considerations (long)

Sun May 13 18:17:47 EDT 2001

aaime wrote:
> 
> During the last days I've spent some time thinking about new vector
> capabilities in GRASS 5.1, and in particular for what concerns multiple
> attribute support and DBMS integration. I would like to share my thougths
> with you. Please forgive my english, I'm not used to write such a long text
> and I don't have enogh time to consult a dictionary and a grammar :-)
> 
> In my opinion we should thinks first about what kind of functionality we want
> to include in GRASS 5.1 before thinking what kind of data structure to adopt.
> Here is a possible list of interesting features that one can hope to find
> into a vector GIS:
> A) ability to store multiple attributes and to have them showed by clicking
> on a map, ability to choose which attribute to use when performing
> computations on the map;
> B) ability to support overlay operations on vector data (which means also to
> join attribute tables) (overlay: intersection, erase, identity and so on, if
> you're familiar with ARC/INFO);
> C) ability to query maps both by spatial criteria, both on the attribute
> values just like in a SQL query
> D) ability to relate attribute tables with some other non spatial information
> (catastral map with id referring to a table describing the owners of each
> parcel, and so on);
> E) ability to make concurrent users to make modifications on the same data.
> There may be other requests, of course, and I haven't considered 3D vector
> data such as TIN, but I think they are enogh to explain my proposals.
> 
> I think that A) and B) are essential requirements if we want to claim that
> GRASS is a vector GIS.

Andrea

Some thoughts.

A) Multi-att support is largely underway in GRASS5.1. The question of
displaying the data and querying through a graphical interface is a
separate question. There is an idea to have a geometric representation
of the map generated on the fly and stored in memory while a map is
being displayed and queried. Also, this could be cached so that an
unchanged, recently accessed map can be quickly redrawn without having
to re-render the data from the topo data (in dig and dig_plus). This
would be optional and the user would determine how much disk space to
allocate to the cache.

Getting a decent (screen) display in the first case isn't something
we've really addressed yet. Development on the GRASS monitor will
continue for now, but maybe we should be thinking about a new GUI(
GTK+/KDE-based, Java, win32, openstep) or perhaps drivers for existing
3rd-party systems to access, display, query, maybe even edit GRASS
databases. This is for beyond 5.1

B) I think this requires a wholly new approach to get such operations
working correctly and efficiently. Segmented processing of vector maps
and network capabilities are needed here. This is closely related to the
question of improving the build process, so that it can allow updating
of maps without having to do monolithic builds each time. It also seems
to me to be very dependent on C). . .

> C) is so common among vector GIS that I would look with suspect to a GIS that
> don't perform such an operation. D) and E) are usually offered by high end
> systems in conjunction with a DBMS that sports spatial data extensions, and
> may be offered by GRASS if OSVectDB would turn into a real system (I think
> that now they are at a specification level).

dig_plus stores topo data, but not spatial locational data, at least not
in a form that would allow efficient and spatially confined querying. So
the plan is to have a third representation of the map, which would be
essentially an R-tree (or similar) whose data is the type and index of
the entities in the map, and is keyed spatially. A new element of the
GRASS database would be created to store this - dig_spatial or something
similar. For this we need to establish some kind of generic spatial tree
with an API that allows easy access.

> Now, let's see what kind of data structure we can use in order to support A),
> B), and C) functions.
> To support only A) and B) plain files are a good solution until the number of
> data involved is not high. There are many possibilities, but I think that DBF
> files are a good solution. Why?
> * they are binary files, so access is faster that ASCII files;
> * they are quite standard, almost any spreadsheet can read them and most DBMS
>   have some way to import them (well, at least for what concerns commercial
>   DBMS);
> * because they don't require any growth in our software base, we already have
>   a library to access them: shapelib.
> Althought shapelib has limited capabilities when it comes to manage dbf files
> I think that it does what is needed. So we could store only one index
> toghether with geometric data and have all attributes stored in the DBF file.
> That'a a simple solutions, but it seems also effective when only A) and B)
> requirement are considered.

I agree we should move away from text file representations as they are
too slow - except maybe for single-stage operations like import and
export.

> If you consider also C) requirements DBF are not the best choice, since they
> don't support access thru SQL language. I think that here a DBMS is
> necessary, since we get the power of SQL queries for free. Berkeley DB is not
> a solution because it doesn't support SQL. PostgreSLQ is, and thru
> referential integrity capabilities it would allow us to support also E)
> requirement. If we want to stick on DBF files we have to choose wheter to
> build into GRASS a minimal SQL support by hand or not let the user perform
> queries unless a real DBMS is used. A SQL support based on DBF files would
> be anyway slow because one have to do a sequential scan on attribute files
> whereas a DBMS can use indexes and a built in query optimizer.
> 
> A solution that is based on storing topologic information in our classic
> files and attributes into a database (DBF or Postgres) seems to me a good
> choice. But it's not enogh.
> When it comes to give good support to overlay and spatial queries you also
> have to think at a fast way perform them: spatial indexes are
> the solutions, and there are some already made libraries that can build
> R-trees... the spatial index would be stored in a sepate file. So, one file
> for the topology, one file for attributes and one (optional) file for the
> spatial index. Since performance is an optional, we could add spatial index
> support later (say in GRASS 6) and do sequential scans in the meantime.
> 

As in my remarks above. . . but it might be possible (through server
side programming perhaps) to allow the tables to access the dig_spatial
file. This has been mentioned on the list before, at least for
PostgreSQL.

An example:- a table could contain a field that has information about
the size and offset of a data block in dig_spatial containing the
necessary data. Then a query could look up some records by index or
querying on another field, extract this information, perform a spatial
query on the R-tree (with its internal API), and then return the entity.
This may then be part of another map. Or - a direct spatial query could
look up the R-tree and return the records which are then queried with
SQL, returning a result. And of course it is all behind the scenes',
giving the appearance of integrated spatial and data querying. 

> Now, I also would like to perform some criticts on site data:
> * access is slow, mainly because they are kept in ascii format and because
>   the data structure can vary from record to record (-> site format is now
>   too flexible);
> * site API is not the best part of the GIS library, in my humble opinion, but
>   that is mainly due to the poor file structure.
> Why treat line, polygon and point data in a different way? Wouldn't it be
> possible, and more efficient, to store coordinates and an index into a binary
> file and put all the attributes into a DBF file? Or in a table inside a DBMS?
> 

Currently (in 5.0) :

Point data are stored as lines of two points with the same start and end
point. Weird! In 5.1 we have two point types - a SITE and a CENTROID.
This is in the vector map, not the site lists.

Lines, really polylines, are the atomic units of most GRASS maps, linear
and area. Two types exist in 5.0 and will be the same (essentially) in
5.1: LINE and BOUNDARY. These represent the arc segments of the networks
or two-dimensional manifolds that make up GRASS vector maps. LINE maps
should conform to the definition of a NETWORK, and area maps (with
polygons) to a 2-dimensional manifold (2DM). I think this is how it is
just now, but we should make sure it is followed strictly. 

Areas or polygons are a kind of composite entity, as are islands, which
are composed during the build process from indexed references to their
sub-components. I think one of the short-comings of the topological
format at the moment is that it allows only these derivative types,
while I think many others would be useful eg. - as a minimum - linear
aggregates, to allow entities that reference branched structures as a
single unit, like tributary systems. This would be _in addition to_ not
instead of, providing categories for the individual arcs. Multipoint
sets are also required, eg. a set of relevee locations may constitute a
vegetation community description, where the individual sites might have
their own quite distinct set of attributes. It works the other way
round: why can't area boundaries have their own attribute sets separate
from those of the area itself? There is occasionally a need for this,
though it's rarer than the other examples above.

> Using binary files would give us a huge performance improvement, and to
> smaller files. I've seen it a the GRASS Day 2001 in Trento, Italy, somone had
> an implementation of a site API and format that stores all data in a binary
> file that also happens to be a quadtree (a fast way to store and index point
> data -> they performed spatial queries in a really fast way, it was
> impressive). I think that he's willing to donate that API to GRASS, he
> seemed only concerned about stability and code quality.
> 
> Using DBMS tables or DBF files every record would get the same attributes,
> and we would have attributes names too -> this would also lead to a cleaner
> site API.
> You should also consider that this way line, polygon and site management
> would share some code leading to a smaller gis library (that means also
> smaller to mantain, a nice feature in the long run). This would also lead to
> an easier attribute management when it comes to use polygon and site
> data at the same time (I'm thinking about Voronoi diagrams, but also
> to overlay between polygon and site data).
> 

The hierarchy of entities in the dig_plus structure are a good candidate
for OOP treatment, and polymorphism through inheritance. That would
reduce code, but not necessarily binary size. It would also make a more
versatile and manageable data structure: but we don't it seems do that!

Regards

David

> What are your opinions?
> Regards
> Andrea Aime
> 
> ----------------------------------------
> If you want to unsubscribe from GRASS Development Team mailing list write to:
> minordomo at geog.uni-hannover.de with
> subject 'unsubscribe grass5'

---------------------------------------- 
If you want to unsubscribe from GRASS Development Team mailing list write to:
minordomo at geog.uni-hannover.de with
subject 'unsubscribe grass5'