[GRASS-dev] v.univar question: Why not lines and areas?

Wed Jan 30 01:38:52 EST 2008

On 30/01/08 02:43, Michael Barton wrote:
> 
> On Jan 29, 2008, at 5:12 PM, Moritz Lennert wrote:
> 
>> On 28/01/08 16:22, Michael Barton wrote:
>>> On Jan 28, 2008, at 5:50 AM, Moritz Lennert wrote:
>>>> On 27/01/08 20:30, Michael Barton wrote:
>>>>> v.univar only works with points. But since it is calculating
>>>>> stats on a field in the attributes table, it should work the same
>>>>> for all vector objects. Can we get rid of the limitation that it
>>>>> only works with points?
>>>> There was some debate [1] about the statistical validity of working
>>>>  with the other types, as the way it was programmed, the statistics
>>>>  were calculated with weights which corresponded to line length /
>>>> area surface .
>>>> I guess we might want to distinguish between a v.univar which works
>>>> on the actual vector objects from a v.db.univar which works on any
>>>>  arbitrary attribute (or combination of attributes). We could write
>>>> a C-replacement of the current v.db.univar script on the base of
>>>> the code I have for the classification algorithms used in v.class.
>>> AFAICT, v.univar does not calculate anything from vector topology,
>>> only from an attribute column.
>> [...]
>>> An attribute is the same whether it's linked to a point, line, or
>>> area.
>>
>> v.univar currently calculates as follows for lines and areas, even 
>> though the results are never printed (main.c):
>>
>> [lines:]
>> 206                             l = Vect_line_length ( Points );
>> 207                             sum += l*val;
>> 208                             sumsq += l*val*val;
>> 209                             sum_abs += l * fabs (val);
>> 210                             total_size += l;
>>
>> [areas:]
>> 270                             a = Vect_get_area_area ( &Map, area );
>> 271                             sum += a*val;
>> 272                             sumsq += a*val*val;
>> 273                             sum_abs += a * fabs (val);
>> 274                             total_size += a;
>>
>> 285             if ( (otype & GV_LINES) || (otype & GV_AREA) ) {
>> 286                 mean = sum / total_size;
>> 287                 mean_abs = sum_abs / total_size;
>>
>> So the mean is actually a weighted mean with the area as weight. I don't
>> really no why Radim coded it like this at the time, and I think we
>> should change this so that it just uses unweighted feature counts, just
>> as Roger suggested at the time. Try the attached (untested) patch.
>>
>> One thing that does potentially matter, though, is whether to use the 
>> features or the attribute columns as a base. If you have several 
>> features with the same cat value, this can make a difference, as in 
>> the former case they will all be counted individually, whereas in the 
>> latter case, they will only be counted once. If each of the features 
>> has an indvididual meaning than the former case seems more correct, 
>> but if not (e.g. each island of the Philippines counted separately in 
>> a table which lists population by country). Obviously we could just 
>> say that it is up to the user to make sure that the map data is 
>> correct, i.e. if we take the above example, there should only be one 
>> centroid linked to data per country).
>>
>> The way the routines are written in v.class, they take an arbitrary 
>> array of floats, so it is up to the individual modules to decide how 
>> to create this array.
>>
> 
> This is all very interesting. It is a bit worrisome too. I don't want a 
> mean of an attribute column weighted by area unless I specifically ask 
> for it. This suggests that people using v.univar may not be getting what 
> they think they are getting. I think it is an excellent option, but 
> should not be a silent default.

Well, since the results are not printed, the problem doesn't really 
exist. The patch I sent doesn't weight at all, just counts features.

> 
> How to count the features is a bit of an issue, but couldn't this be 
> left up to the user too--summarize by cat or by individual feature as an 
> option?

That's why I think we should have a library function which calculates 
stats (i.e. extend what it is the v.class code), and let the modules 
deal with such issues.

Moritz