[GRASS-user] speeding up v.clean for large datasets

Fri Apr 26 01:16:57 PDT 2013

On Fri, Apr 26, 2013 at 8:33 AM, Mark Wynter <mark at dimensionaledge.com> wrote:
> Thanks Markus for the explanation.  I've set PostGIS as my backend.  Will revert as I get more into v.net

Oops. Direct PostGIS is 1) experimental, 2) slow. For vector
operations it is heavily recommended to use the native GRASS vector
format and import vectors first with v.in.ogr. v.external and
v.external.out should not be used, i.e. v.external.out -g should
report format=native.

Markus M

>
>
>
> On 22/04/2013, at 8:20 PM, Markus Metz wrote:
>
>> On Mon, Apr 22, 2013 at 11:03 AM, Mark Wynter <mark at dimensionaledge.com> wrote:
>>> Thanks Marcus.
>>> Tried sqlite backend suggestion - no improvement  - then read that that sqlite is the default backend for grass7.
>>> I suspect the complexity of the input dataset may be the contributing factor. For example, I ran v.clean over the already cleaned OSM dataset (2.6M lines), and it took only a few minutes since there were no intersections and no duplicates to remove.
>>
>> I tested with a OSM road vector with 2.6M lines, the output has 5.3M
>> lines: lots of intersections and duplicates which were cleaned in less
>> than 15 minutes.
>>
>> I am surprised that you experience slow removal of duplicates,
>> breaking lines should take much longer.
>>
>> About why removing duplicates takes longer at the end: when you have 5
>> lines that could be duplicates you could check
>>
>> 1 with 2, 3, 4, 5
>> 2 with 1, 3, 4, 5
>> 3 with 1, 2, 4, 5
>> 4 with 1, 2, 3, 5
>> 5 with 1, 2, 3, 4
>>
>> or checking each combination only once:
>>
>> 1 with 2, 3, 4, 5
>> 2 with 3, 4, 5
>> 3 with 4, 5
>> 4 with 5
>>
>> alternatively
>>
>> 2 with 1
>> 3 with 1, 2
>> 4 with 1, 2, 3
>> 5 with 1, 2, 3, 4
>>
>> The current implementation uses the latter.
>>
>> Markus M
>>
>>>
>>>
>>>> Something is wrong there. Your dataset has 971074 roads, I tested with
>>>> an OSM dataset with 2645287 roads, 2.7 times as many as in your
>>>> dataset. Cleaning these 2645287 lines took me less than 15 minutes. I
>>>> suspect a slow database backend (dbf). Try to use sqlite as database
>>>> backend:
>>>>
>>>> db.connect driver=sqlite
>>>> database=$GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db
>>>>
>>>> Do not substitute the variables.
>>>>
>>>> HTH,
>>>>
>>>> Markus M
>>>
>