[GRASS-user] speeding up v.clean for large datasets

Markus Metz markus.metz.giswork at gmail.com
Sun Apr 21 01:24:11 PDT 2013


On Sat, Apr 20, 2013 at 3:02 AM, Mark Wynter <mark at dimensionaledge.com> wrote:
> Thanks Markus.
> Upgraded to GRASS 7, and re-ran v.clean on same OSM Australia dataset.
> Substantially faster.  The bulk of the time related to removal of duplicates, and it got exponentially slower as the process approached 100%.  Overall it took 12 hours but I'm wondering how it would perform if we were to repeat v.clean for even larger road networks e.g. USA or Europe?

Something is wrong there. Your dataset has 971074 roads, I tested with
an OSM dataset with 2645287 roads, 2.7 times as many as in your
dataset. Cleaning these 2645287 lines took me less than 15 minutes. I
suspect a slow database backend (dbf). Try to use sqlite as database
backend:

db.connect driver=sqlite
database=$GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db

Do not substitute the variables.

HTH,

Markus M

>
> I'm tempted to try dividing the input dataset into say 4 smaller subregions (i.e. vector tiles), and then try patching them back.
> I read that we will still need to run v.clean over the patched datasets to remove duplicates.
> Since the only duplicates should be nodes along the common tile edges, is there a way to in effect constrain the v.clean process to slithers containing the common edges?
> I've had a quick go at g.region but to no avail.
>
> Thanks
>
> GRASS 7.0.svn (PERMANENT):/data/grassdata > v.clean input=osm_roads_split output=osm_roads_split_cleaned tool=break type=line -c
> --------------------------------------------------
> Tool: Threshold
> Break: 0
> --------------------------------------------------
> Copying vector features...
> Copying features...
>  100%
> Rebuilding parts of topology...
> Building topology for vector map <osm_roads_split_cleaned at PERMANENT>...
> Registering primitives...
> 971074 primitives registered
> 13142529 vertices registered
> Number of nodes: 1458192
> Number of primitives: 971074
> Number of points: 0
> Number of lines: 971074
> Number of boundaries: 0
> Number of centroids: 0
> Number of areas: -
> Number of isles: -
> --------------------------------------------------
> Tool: Break lines at intersections
>  100%
> Tool: Remove duplicates
>  100%
> --------------------------------------------------
> Rebuilding topology for output vector map...
> Building topology for vector map <osm_roads_split_cleaned at PERMANENT>...
> Registering primitives...
> 2462829 primitives registered
> 13322052 vertices registered
> Building areas...
>  100%
> 0 areas built
> 0 isles built
> Attaching islands...
> Attaching centroids...
>  100%
> Number of nodes: 1819237
> Number of primitives: 2462829
> Number of points: 0
> Number of lines: 2462829
> Number of boundaries: 0
> Number of centroids: 0
> Number of areas: 0
> Number of isles: 0
>
>
>
>
>
> On 19/04/2013, at 6:07 PM, Markus Metz wrote:
>
>> On Fri, Apr 19, 2013 at 9:06 AM, Mark Wynter <mark at dimensionaledge.com> wrote:
>>> Hi All, we're looking for ways to speed up the cleaning of a large OSM road network (relating to Australia).  We're running on a large Amazon AWS EC2 instance.
>>>
>>> What we've observed is exponential growth in time taken as number of linestrings increases.
>>>
>>> This means it's taking about 3 days to clean entire network.
>>>
>>> We were wondering if we were to split the dataset into say 4 subregions, and clean each separately, is it then possible to patch them back together at the end without having to run v.clean afterwards?  We want to be able to run v.net over the entire network spanning the subregions.
>>>
>>> Alternatively, has anyone found a way to speed up v.clean for large network datasets?
>>
>> Yes, implemented in GRASS 7 ;-)
>>
>> Also, when breaking lines it is recommended to split the lines first
>> in smaller segments with v.split using the vertices option. Then run
>> v.clean tool=break. After that, use v.build.polylines to merge lines
>> again. Or use in GRASS 7 the -c flag with v.clean tool=break
>> type=line. The rmdupl tool is then automatically added, and the
>> splitting and merging is done internally.
>>
>> Markus M
>


More information about the grass-user mailing list