[GRASSLIST:5979] Re: speeding up v.in.ogr

Wed Mar 2 21:33:06 EST 2005

Chris:
>>> I have been trying to import a detailed shapefile of the florida
>>> coast (about 24M) into grass6.0b2, using v.in.ogr. Unfortunately,
>>> having started 3 days ago, the import has not yet completed. It is
>>> making progress, as the number of intersections is incrementing,
>>> however this surely should not be taking quite this long. Are there
>>> ways of speeding up the process?
>>> 
>>> I am on a 1GHz OS X Powerbook with 1GB RAM.

Radim:
>>Is it one or few big shapes or many small?

Chris:
> It is the florida coastline, so it should be one large shape, with
> plenty of small islands scattered alongside.

Radim:
>> Then the problem is that bounding box of that long boundary
>> intersects all the islands and v.in.ogr will try to break those
>> lines which takes long time. It was already discussed here. Try to
>> modify v.in.ogr so that it writes long boundaries in more shorter
>> parts.

It was discussed off list, correspondence (including patch) follows.

I only come up against this myself every couple of months so have just
it be slow, but I've got a detailed coastline vector map with the
occasional offshore island .. it is topologically clean, but same long
time to process issues.

Hamish

-------------------------------------------------------------------------

From: Radim Blazek <blazek at itc.it>
Subject: Re: polygon cleaning
Date: Mon, 9 Feb 2004 12:27:15 +0100
To: Hamish <hamish_nospam at yahoo.com>

On Wednesday 04 February 2004 23:42, you wrote:
> Hi Radim,
>
>
> I was just wondering if I was getting the expected behaviour out of
> v.in.ogr or if something was going wrong.
>
> Some feedback for you about how well things scale for very big files
> anyway..
>
>
> I've got a big shapefile (100mb shp, 12mb dbf) of all the forested areas in
> my country. The import went smoothly, but during the "Break boundaries"
> stage things seemed to get exponentially slower after about 12,000 lines.
>
> On a new 2.8GHz Pentium4 it took about 36 hours to get through this single
> step; the rest of the import went pretty quickly.
> [It used 735mb RAM, but I have 2gig RAM, so no swapping]
>
>
> Is "Break boundaries" inherently exponential, or can the algorithm be
> improved?
>
>
> An import of the 622mb shapefile (topographic contour lines) went pretty
> quickly, by the way (minutes).

36 is very bad, but seems to be strange, to import 
my shapefile circa 140 Mb (shp), 267117 boundaries in GRASS, 
v.in.ogr takes 91 minutes on my 1.5GHz.
"Break boundaries" means Vect_break_lines(). It is of course possible 
to improve everything, but the most time consuming problem is already solved, I thing.
Vect_break_lines uses spatial index twice, first to find all lines in bounding box 
which could intersect processed line (line A), then second spatial index is build for 
all segment in line B. This way, there should be no exponential dependency on input.

Could you try to localise the problem somehow? Probably select only lines above 12,000.
Does v.clean tool=break take also so long time?

Radim

>
>
> ?,
> cheers,
> Hamish
>
>
>
> Here's the output:
>
> G:topo4_nztm > v.in.ogr dsn=. layer=native_poly out=native_poly
> WARNING: Datum 'NZGD_2000' not recognised by GRASS and no parameters found.
>          Datum transformation will not be possible using this projection
>          information.
> Layer: native_poly
> WARNING: Area size 1.2e-09, area not imported.
> WARNING: 129 features without geometry.
> -----------------------------------------------------
> 142948 primitives registered
> 142906 areas built
> 142868 isles built
> Number of nodes     :   142890
> Number of primitives:   142948
> Number of points    :   0
> Number of lines     :   0
> Number of boundaries:   142948
> Number of centroids :   0
> Number of areas     :   142906
> Number of isles     :   142868
> Number of incorrect boundaries   :   44
> Number of areas without centroid :   142906
> -----------------------------------------------------
> WARNING: Cleaning polygons, result is not guaranteed!
> Building topology ...
> Number of nodes     :   142890
> Number of primitives:   142948
> Number of points    :   0
> Number of lines     :   0
> Number of boundaries:   142948
> Number of centroids :   0
> Number of areas     :   -
> Number of isles     :   -
> -----------------------------------------------------
> Snap boundaries (threshold = 1.000e-03):
> All vertices: 5815998
> Registered points (unique coordinates): 5671966
> Nodes marked as anchor     : 5671439
> Nodes marked to be snapped :   527
> Snapped vertices :   554
> New vertices     :   113
> -----------------------------------------------------
> Break polygons:
> Registering points ... 5671439
> All points (vertices): 5815678
> Registered points (unique coordinates): 5671439
> Points marked for break: 143327
> Breaks:  1930
> -----------------------------------------------------
> Remove duplicates:
> Duplicates:   396
> -----------------------------------------------------
> Break boundaries:
> Intersections:     4
> -----------------------------------------------------
> Remove duplicates:
> Duplicates:     4
> -----------------------------------------------------
> Change dangles to lines:
> Removed dangles:     5  removed lines:     5
> -----------------------------------------------------
> Remove bridges:
> Removed bridges:     0  removed lines:     0
> -----------------------------------------------------
> Building topology ...
> 142973 areas built
> 142496 isles built
> Number of nodes     :   143782
> Number of primitives:   145340
> Number of points    :   0
> Number of lines     :   0
> Number of boundaries:   145340
> Number of centroids :   0
> Number of areas     :   142973
> Number of isles     :   142496
> Number of areas without centroid :   142973
> Layer: native_poly
> -----------------------------------------------------
> Building topology ...
> -----------------------------------------------------
> 257057 primitives registered
> 142973 areas built
> 142496 isles built
> Number of nodes     :   256580
> Number of primitives:   257057
> Number of points    :   0
> Number of lines     :   0
> Number of boundaries:   143803
> Number of centroids :   113254
> Number of areas     :   142973
> Number of isles     :   142496
> Number of areas without centroid :   29719
> -----------------------------------------------------
> WARNING: 3 areas represet more (overlapping) features, because polygons
>          overlap in input layer(s). Such areas are linked to more than 1
> row in attribute table. The number of features for those areas is stored as
> category in field 2.
> 113255 input polygons
> total area: 7.250188e+10 (142973 areas)
> overlapping area: 3.488045e+04 (3 areas)
> area without category: 4.978083e+09 (29719 areas)
>
>
> [Finish]
>
> all 4 intersections were somewhere in the last 5000 or so lines
> processed.

From: Radim Blazek <blazek at itc.it>
Subject: Re: polygon cleaning
Date: Tue, 10 Feb 2004 10:47:32 +0100
To: Hamish <hamish_nospam at yahoo.com>

On Monday 09 February 2004 14:23, you wrote:
> > Could you try to localise the problem somehow?
>
> is it useful to run with DEBUG=2?

I don't think so.

> > Does v.clean tool=break take also so long time?
>
> running now, let you know tomorrow..

> I just broke out of the 'v.clean tool=break' after 90 minutes and 25147
> of ~143000 lines processed. I think it would go the full 36 hours if I
> left it.

> BTW, it took 5 minutes just to copy(!):

It is because it copies line by line and builds topology, BTW
which db driver? Postgres should be faster then DBF.

> Number of boundaries:   143803
> Number of centroids :   113254
> Number of areas     :   142973
> Number of isles     :   142496

There must be some mystery in your data. I have tried 
v.clean tool=break on my 140MB shape, 1.5GHz CPU
Number of boundaries:   267117
Number of centroids:     97314
Number of areas:         98124  
Number of islands:       21197   
and it takes 
real    17m0.085s
user    15m56.800s
sys     0m46.530s

Strange is that, number of boundaries in your map is almost equal to 
the number of areas and to the number of isles. That means that,
areas are isolated, do not share common boundaries, is it right?
But I don't see it as a reason to make cleaning so slow.

Could you try to extract just a smaller part of you map, 
(v.select, v.extract) and try to run v.clean if it takes 
still proportional time of that 36hours.
Try somehow find the type of data causing the problem.
Now I have just one idea, because the areas are probably isolated, 
it could be that exist one BIG boundary around all the map
(I thing that something like that exists in ArcInfo), 
in that case, when this boundary is processed, it selects 
lines which could intersect it by bounding box, in that case ALL
lines, and to check intersection of ALL segments of that BIG line 
with ALL segments of ALL other lines can take a very long time.
Is it clear explenation?

Radim

From: Radim Blazek <blazek at itc.it>
Subject: Re: polygon cleaning
Date: Wed, 11 Feb 2004 15:40:07 +0100
To: Hamish <hamish_nospam at yahoo.com>

On Wednesday 11 February 2004 10:25, you wrote:
> I'll try extracting the ones around that green & yellow PNG as there are
> only a few there.
>
> > Try somehow find the type of data causing the problem.
> > Now I have just one idea, because the areas are probably isolated,
> > it could be that exist one BIG boundary around all the map
> > (I thing that something like that exists in ArcInfo),
> > in that case, when this boundary is processed, it selects
> > lines which could intersect it by bounding box, in that case ALL
> > lines, and to check intersection of ALL segments of that BIG line
> > with ALL segments of ALL other lines can take a very long time.
> > Is it clear explenation?
>
> Yes, clear explanation, but not the case.

Why not? I think it is. Not just one BIG, but many big. The problem is,
that areas are big, and do not share boundaries, so bounding box of
one such big area is big and selects many other boundaries.

I have got idea. Split boundaries to smaller parts, something like

for ( line = 1; line <= Vect_get_num_lines ( In ); line++) {
    type = Vect_read_line ( In, Points, Cats, line);
    v = 0;
    while ( v < Points->n_points ) {
        Vect_reset_line (OPoints);
        for (i = 0; i < maxvertex && v < Points->n_points ) 
              Vect_append_point ( OPoints, Points->x[v], Points->y[v], 0);
        }
        Vect_write_line (Out, type, Points, Cats);
    }
}

Could you try it? If it helps, it would be reasonable to add this splitting 
to v.in.ogr / v.clean.

Radim

From: Radim Blazek <blazek at itc.it>
Subject: Re: polygon cleaning
Date: Wed, 11 Feb 2004 15:41:44 +0100
To: Hamish <hamish_nospam at yahoo.com>

On Wednesday 11 February 2004 10:43, you wrote:
> ok, figured out v.select+v.in.region.
>
> extracted the small island (gray/pink "d.vect -c" PNG from prev. email)
>
> Still seems a little slow 3 min for 500 areas, but FYI:

Yes, v.overlay is slow because it breaks all lines. Also if you need
v.overlay between big and small vector (area size), it is better to run 
v.select first.

Radim