[GRASS-dev] 6.x Vector's wording commented

Thu Nov 30 18:43:01 EST 2006

Hello,

In the following, I have tried to comment the _wording_ (the glossary)
of GRASS GPL documentation of the new vector engine. I would highly
appreciate that someday someone could comment KerGIS documentation too,
since both projects are based on CERL GRASS and are providing something
that is particular: topological GIS.

Some may be wishing that the open source/free software provides for
"free" the equivalent of the commercial products. But I, personnally, do
not have chosen open source software for free like "talking about
freedom after the twelfth gratis beer". I don't want, neither, to make
clones of commercial standard gis software and I'm not that impressed
about the bulk of them, having tried to make something hard with them 
and never managed to (with a reasonable investment in time and
money) while it has been achieved with GRASS GPL 5.0 at its time and in
time.

Topology is CERL GRASS based GIS singularity. And it's a good one. Libre
software has been blamed from neither improving things, nor introducing
outstanding new schemes, but simply mimicking, or even "faking".
This is not true if one considers GRASS CERL for example, the CSRG BSD
improvements and so on. It is more true with the bulk of the late
production of the so-called free software, poorly innovative altogether.

But with GRASS GPL and KerGIS this can/could change precisely
because we are on our own track and if we keep improving on our own
scheme. Even if I have suffered from some french GRASS users' spite (and
never from developers... perhaps because people _working_ may disagree,
but respect other people _working_...), I wish wholeheartly the best to
GRASS GPL, not because I'm a nice guy, but because this is my interest:
GRASS GPL loosing users will not increase KerGIS base; and GRASS GPL
providing great things will give me, and others, incentive to
improve KerGIS too. And if I make momentary mistakes, there will
be a sibling project to rely upon. I think the reciprocal holds
too. Not competition:  emulation; another point of view to react
with or against.

In what follows, I comment about the wording of the documentation, _not_
about the engineering decisions or the implementation of your new vector
engine, since this is none of my business to do so. My comments are
based on my knowledge of the legacy GRASS and I hope that people will
find some explanations "obvious". It will mean that the explanation is
the simplest one, that is the shortest path to the truth of its object
(this definition is not mine; it was given by Charles de Gaulle). I
don't think that what is below is perfect, since I have limited time,
and since it is in an approximate english. A french version would,
obviously, match better what I have in mind.

If things may seem obvious when now explained, they were not,
at least in my mind, some years ago. I explicitely state things that
were implicit or embedded in the code, or I introduce supplementary
notions that, IMO, give sense to the whole thing.

---quote
Background

   Generally,  the  vector  data  model  is  used  to describe geographic
   phenomena  which may be represented by geometric entities (primitives)
   like  points,  lines,  and areas. The GRASS vector data model includes
   the  description of topology, where besides the coordinates describing
   the  location  of  the  points, lines, boundaries and centroids, their
   spatial relations are also stored. In general, topological GIS require
   a  data structure where the common boundary between two adjacent areas
   is stored as a single line, simplifying the map maintenance.
---endquote

Assigning the vector description of the geometrical data its place in
the GIS whole could be made like this:

GRASS/KerGIS is a _system_ that is a coherent set of programs meant to
work together in the same environment, used to add value to geometrical
descriptions, whether by deducing from the geometrical descriptions
topological or logical properties, or by adding non geometrical 
attributes to geometrical data.

The system accepts and treats geometrical data in 3 main flavors:

	1) By erudition: no rules are known but only a succession of
	ponctual facts. This is the raster description. Since,
	mathematically, there may be an infinite amount of points lying in a
	two dimensions region, the single values are not points, but
	represent a definite rectangular area with a fixed width (x) and
	height (y) named a CELL. The main utilities of the GIS will take
	these raw facts and deduce logical relationship, for example
	deducing watershed etc.

	2) By describing exhaustively the organization of the space in the
	region of interest by giving rules to group points in one dimension
	geometrical elements. The element of the vector is an arc, that is
	an oriented set of vertices. The vector is an exhaustive description
	of a finite number of arcs with almost infinite precision.

	3) By describing a set of singularities (SITES) that may be, whether
	singular sites of interest, or an approximation of some aspect of
	a whole region by the gift of a finite number of weighted
	singularities. SITES are used as themselves, or as a
	transformation description.

The IMAGERY is _not_ a fundamental geometrical description. It is a set of
utilities to put an incorrect raster description in a canonical state,
that is, mainly, to insure that the CELL are homogeneous, having a fixed
width, height and comparison plane.

Note about your description: the first line says  that:
   points,  lines,  and areas are the primitives.

This is true but, since you have introduced 3D you should add: volumes.
But after that, you name _centroids_: centroids are _not_ geometrical
elements; they are not primitives. This is inconsistent. The distinction
shall be introduced afterward.

The next section is the more problematic one, since distinct notions are
merged and blured, and since some definitions are circular. Some choice
of the words are problematic too when one comes to the "category".

---quote
Introduction

[skipping what is the description of the features and that are
engineering decisions]

   The following vector objects are available:

     * point: a point
     * line:  a  sequence  of  vertices  connected  by  line(s)  with two
       endpoints called nodes
     * boundary: the border line of an area
     * centroid: the label point of an area
     * area: the topological composition of centroid and boundary
     * face: a 3D area
     * kernel: a 3D centroid in a volume (not yet implemented)
     * volume: a 3D corpus (not yet implemented)

---endquote

Some features in GRASS 6.x vector engine will be hard to describe,
because (I think this is due to the support of external formats) you are
melting things that are in a different league.

Taking a mathematical standpoint for such a package is always a good
start. The aim is to gain an orthogonal base, that is a set of
independent primitive elements.

Is a centroid independent from the area? no. Is the boundary independent
from an area? no. So area, centroid, boundary do not make an orthogonal
base.

Furthermore, it will be clearer to describe first, and separately, the
_geometrical_ elements, and to introduce after attributes, that is
assignation of non geometrical properties.

I will reset my initial description:

1) the vector element is an ARC, that is an oriented vector of vertices.

2) the vertices can be seen as control points. The nature of the one
dimension element drawned according to these control points can be
called the functional type (in KerGIS V_FTYPE_*). There was (is) only
one functional type supported: V_FTYPE_LINE (polylines, or segments).

3) the topological type of the ARC tells what _geometrical_ element to
deduce from the ARCS. Topological types are : V_TTYPE_DOT, V_TTYPE_PATH,
V_TTYPE_EDGE (were respectively DOT, LINE and AREA in legacy code).

4) the geometrical figures are deduced from the previous informations.
The geometrical figures are V_GTYPE_POINT, V_GTYPE_LINE, G_GTYPE_AREA.

5) the legacy categories are associated with _geometrical figures_ by a
_topological_ mean: a point "on" an element (nearest to this element
than to another one) [category are now in KerGIS group numbers, and
label group names à la DNS].

COMMENT: so a "line" is not a "sequence of vertices connected by lines"
since this definition is circular. The ARC is the way the information is
stored and the topology deduced. The geometrical elements are [I use
KerGIS macro definitions, you should adapt with your terminology]:

1) a point: made by an ARC of type V_TTYPE_DOT;
2) a path: made by an ARC of type V_TTYPE_PATH. The only
functionnal type supported at the moment is the straight line between
two vertices, hence a path appears as a polyline (note: a curve, or
Bézier curve is given between two points with the adjunction of two
control points; so the definition will hold in this case two, but
the implementation of the ARC structures will not suffice; the simplest
definition I have ever found is in the MetafontBook by Donald E. Knuth).
There could be more one day.

3) an area: made by a closed set of connected ARCs of type V_TTYPE_EDGE,
with no connected ARC not belonging to the set lying inside the hull
described by the set [this is true for legacy GRASS and KerGIS; is it
still true for GRASS GPL or do you allow non topologically clean data?].

There is here a subtlety: an area is an element topologically deduced
from one dimension elements (ARCs). In the definition given above, the
vector description is exhaustive. But it is exhaustive about the
elements that are the ARCs. Each ARC belonging to the vector is either
taken into account or not, depending on if it is DEAD or ALIVE. A
dead arc---in the legacy CERL GRASS and in KerGIS, I don't know
for GRASS GPL---is not taken into account when building the support
or plus file.

But with the geometrical elements deduced from these ARCs that are the
description, some elements may _not_ belong to the vector set, the
elements of dimension 2 (areas) or 3 (volumes).

Actually, if one takes a vector with some isle, there are two legitimate
areas: the inner isle, or the complement (the "infinite" space minus the
isle, that is the "outside" area).

Hence a mean shall be given to describe what belongs to the set, and,
by deduction, what does _not_ belong to the set. This is why there
is the need to give a supplementary information (see theorem 9 of
David hilbert's Grundlagen der Geometrie) to tell what areas are
alive: a point "inside". This point is not uniq, and is not a
geometrical primitive: it is a geometrical attribute.

CONSEQUENCE: there is an apparent lack of symetry between dimension 0 and 1,
and dimension 2 and 3, since a group [KerGIS terminology, legacy cat]
of 0 is valid for a point or a path, but not for an area or a volume.

The "centroid" or the "boundary" are not primitive elements, they are
geometrical properties of the area primitive.

NOTE: from the discussion about the v.category, it was clear that there
was perhaps some need, sometimes, to obtain the contour of an area. But
the fact that the boundary is _not_ a primitive element, and that it has
nothing to do here even if sometimes there is need to retrieve the info
(that is what the v.out.* utilities mainly do) is made even clearer by
the fact that this definition does not scale.

Indeed, what is an area contour? The outer boundary, ok. But this is
simply the particular case of the outer boundary of a _group_ of
elementary areas, where the group is reduced to a sole element.

Once you have introduced and explain how everything is constructed from
(level 1) ARCs, to (level 2) geometrical elements, you can explain how
to link non geometrical attributes to groups of geometrical elements.

That's why KerGIS has replaced "category" by "group" since it is the
fundamental idea that was underneath, and since grouping is not only
useful with external attributes: it is also useful for grouping
geometries.

---quote
   For historical reasons, there are two internal libraries for vector:

     * diglib, dig_*(), DIGLIB, libdig.a, digit library, grass3.x, 4.x
       and
     * Vlib, Vect_*(), VECTLIB_REAL, libvect.a, vector library, grass4.x

   The  Vlib  Vector  library was introduced in grass4.0 to hide internal
   vector  files'  formats  and  structures.  In  GRASS  6  everything is
   accessed via Vect_*() functions, for example:

   Old 4.x code:

    xx = Map.Att[Map.Area[area_num].att].x;

   New 6.x functions:

    Vect_get_area_centroid()
    Vect_get_centroid_coor()
---endquote

Actually, they are 3 levels in the vector module:

1) V0: the handling of arrays of points, used for purely geometrical
manipulation and, for example, used when importing from other formats;

2) V1: related to the way the geometrical information are described,
that is the handling of the ARCS;

3) V2: used when building or accessing information deduced from the
ARCs.

The Vect_* functions, in the legacy code and in KerGIS, are related to
the higher level gis database handling of a vectorial element (that is
composed, indeed, of several distinct files spread in dedicated places
in the gis database). The Vect_* functions are related to the vector map
as a whole (an atom), opening and initializing. In an approximate view,
these functions do not handle the inner organization of the files but
built the higher level abstraction, gathering the information spread in
the database.

The V1, V2 and Vect_ were introduced by Dave Gerdes for 4.0. For
completness, I have added V0. Rule of thumb: Dave Gerdes was the main
father of the vectorial engine. If he has decided to introduce these
distinctions, it is probably good to consider them. They are sound. May I
suggest that, in the future (with no obligation to change the
implementation), GRASS GPL team reintroduces this naming scheme?

Historical note: the dig_* function were mainly associated with V0 and
V1 levels.

---quote
Vector library categories and layers

   Note: "layer" was called "field" in earlier version.

   In  GRASS  a  "category"  is  a  feature ID used to link geometry with
   attributes  stored  in  one or many (external) database table(s). Each
   vector   feature   inside   a   vector  map  has  zero,  one  or  more
   <layer,category>  tuple(s). A user can (but not must) create attribute
   tables  which  are  referenced  by  the  layer,  and  rows  which  are
   essentially referenced by the <layer,category> pair.
---endquote

Traditionnally (in CAD), a layer is a set of elements grouped by logical
functionnality. A french translation are sometimes "calque" (translucent
paper, tracing paper) because, before computers, the tracings of
elements were made on translucent papers and when one wanted to see how
the distinct levels fit together they stacked the translucent papers one
on another to see the combination (this matches the meaning in layer
when displaying).

IMHO, since you introduce "layer" about non geometrical attributes with
a rdbm style management, why don't you simply say: table instead of
layer or field, and column_value instead of "category"?

I may be wrong about my interpretation but I must say that from your
description I understand absolutely nothing about the way attributes are
handled :) And the wording does not help.

---quote
Vector library and Attributes

   Note: "layer" was called "field" in earlier version.

   The  old GRASS 4.x 'dig_cats' files are not used any more and vectors'
   attributes  are  stored  in  external  database.  Connection  with the
   database  is  done  through  drivers based on DBMI library (odbc, dbf,
   PostgreSQL and MySQL drivers are available at this time). Records in a
   table  are linked to vector entities by field and category number. The
   field  identifies  table and the category identifies record. I.e., for
   any  unique  combination  map+mapset+field+category,  there exists one
   unique combination driver+database+table+row.
---endquote

I do not understand: do you link by rows? That is, if I have, say an
area with category number 123, that has an equivalent text (for KerGIS
this is group and group name) "74001A2574" ("74001" the insee code of 
a french town in departement 74, the section "A", and the number of the 
parcel in this section) and I have several records (rows) to link to the
area, I need to duplicate the "categories"?

Here "field" is still here instead of "layer". But is the "category" a
number, a text, the value of a column or what?

---quote
map[@mapset] field table [key [database [driver]]]

   If key, database or driver are omitted (on second and higher row only)
   the last definition is used. Definitions from DB file in other mapsets
   may  be overwritten by a definition in the current mapset if mapset is
   specified with map name.

   Wild cards * and ? may be used in map and mapset names.

   Variables  $GISDBASE,  $LOCATION, $MAPSET, $MAP, $FIELD may be used in
   table,  key,  database  and driver names. Note that $MAPSET is not the
   current mapset but mapset of the map the rule is defined for.

   Note  that  features in GRASS vectors may have attributes in different
   tables  or may be without attributes. Boundaries form areas but it may
   happen  that some boundaries are not closed (such boundaries would not
   appear  in  polygon  layer). Boundaries may have attributes. All types
   may be mixed in one vector.

   The  link to the table is permanent and it is stored in 'dbln' file in
   vector directory. Tables are considered to be a part of the vector and
   g.remove, for example, deletes linked tables of the vector. Attributes
   must be joined with geometry.

   Examples:  Examples  are  written  mostly  for  the  dbf driver, where
   database  is  full path to the directory with dbf files and table name
   is the name of dbf file without .dbf extension.

* 1 tbl id $GISDBASE/$LOCATION/$MAPSET/vector/$MAP dbf

   This definition says that entities with category of field 1 are linked
   to  dbf  tables with names tbl.dbf saved in vector directories of each
   map.
---endquote

Here field is still here in place of layer.
Why don't you use a BNF description, making the distinction between
productions and terminals? Why don't you use meaningful examples?

Is "tbl" a reserved word? A production? A value? If it is the name of a
table why don't you use "name_of_my_table" for example? Is "id" a
reserved word, a production ? Why use such a generic term and not, if
it is a column name "column_name" for example?

I may not be a genius, but I think I am not below the average. And
"parsing" your description with my C mind, it coredumps! And I will not
give you the result since it has private informations ;)

Hope this help,
-- 
Thierry Laronde (Alceste) <tlaronde +AT+ polynum +dot+ com>
                 http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C