[GRASS-user] speeding up v.clean for large datasets

Mon Apr 22 03:20:01 PDT 2013

On Mon, Apr 22, 2013 at 11:03 AM, Mark Wynter <mark at dimensionaledge.com> wrote:
> Thanks Marcus.
> Tried sqlite backend suggestion - no improvement  - then read that that sqlite is the default backend for grass7.
> I suspect the complexity of the input dataset may be the contributing factor. For example, I ran v.clean over the already cleaned OSM dataset (2.6M lines), and it took only a few minutes since there were no intersections and no duplicates to remove.

I tested with a OSM road vector with 2.6M lines, the output has 5.3M
lines: lots of intersections and duplicates which were cleaned in less
than 15 minutes.

I am surprised that you experience slow removal of duplicates,
breaking lines should take much longer.

About why removing duplicates takes longer at the end: when you have 5
lines that could be duplicates you could check

1 with 2, 3, 4, 5
2 with 1, 3, 4, 5
3 with 1, 2, 4, 5
4 with 1, 2, 3, 5
5 with 1, 2, 3, 4

or checking each combination only once:

1 with 2, 3, 4, 5
2 with 3, 4, 5
3 with 4, 5
4 with 5

alternatively

2 with 1
3 with 1, 2
4 with 1, 2, 3
5 with 1, 2, 3, 4

The current implementation uses the latter.

Markus M

>
>
>> Something is wrong there. Your dataset has 971074 roads, I tested with
>> an OSM dataset with 2645287 roads, 2.7 times as many as in your
>> dataset. Cleaning these 2645287 lines took me less than 15 minutes. I
>> suspect a slow database backend (dbf). Try to use sqlite as database
>> backend:
>>
>> db.connect driver=sqlite
>> database=$GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db
>>
>> Do not substitute the variables.
>>
>> HTH,
>>
>> Markus M
>