[GRASSLIST:1032] openstreetmap data import into GRASS

Mon May 8 06:52:27 EDT 2006

> I'm looking at how to import the OpenStreetMap project data into
> GRASS. The data is stored as XML;

[ http://www.openstreetmap.org ]

..
> This is a huge dataset,
..
> It is likely that they'll get up to 10 million points+ though - is
> PostGIS more appropriate to use with such a large dataset?
> 
> The idea is to use GRASS vector modules to clean up the data, do
> shortest path routing, etc.
..
> > The id links 
> >  from nodes to segments
> >  from segments to ways
> e.g.
> 
> <?xml version='1.0' encoding='UTF-8'?>
> <osm version='0.3' generator='JOSM'>
>   <node id='2222' lat='47.5118' lon='11.100' />
>   <node id='3333' lat='48.4014' lon='10.988' />
>   <node id='3334' lat='48.4014' lon='10.988'>
>     <tag k='name' v='optional name of node' />
>   </node>
> ...
>   <segment id='44444' from='2222' to='3333'>
>     <tag k='name' v='optional name of segment' />
>   </segment>
> ...
>   <way id='299353' timestamp='06-5-5 13:56:42'>
>     <seg id='44444' />
>     <seg id='44445' />
>     <seg id='44446' />
> ...
>     <tag k='name' v='A99' />
>     <tag k='motorway' v='highway' />
> ...
>   </way>
> </osm>

Hi,

I have now written a bash-script prototype (attached). It's rather slow
and inefficient, and will scale poorly; using it on the whole OSM
dataset including the US TIGER data is inappropriate (would take a week
to run). None-the-less it works* if you use it on the remaining non-US
data.

[*] see bash scripting problem at end of msg.

It is intended to demonstrate the method more clearly than a complex awk
script and to be understood by more readers than a C or Matlab script.
I am currently primarily interested in getting the method right.
For Mark II, maybe Python or Perl would be best? Or via queries to
PostGIS database(s)? (I leave that to others who know those)

> there is a rather dense osm2shp converter via awk:
>  http://article.gmane.org/gmane.comp.gis.openstreetmap/3005
[newlines have to be fixed before it will run]

I have doubts as to if the above linked osm2shp awk script is attaching
the correct cats to the correct segments. I haven't finished processing
to check though. If the segments shapefile is loaded into GRASS with
v.in.ogr, v.build.polylines needs to be run. This awk script discards
attributes as well.

how my script works:  (the name will change)

OSM data is split into three parts, each with its own ID code (not
unique between the 3 DBs)

1) nodes. x,y[,z,t,creator,..]
2) segments. made up of two (and only two) nodes. (non-polylines)
3) "ways". (ie routes) a series of segment lines defining a road.
Usually contains attribute data but the field names and values seem to
be somewhat user defined* from route to route, so the only common field
useful for route planning was "name". I find "way" too close to
"waypoints", so I will try and call these "routes" to lessen confusion.

[*] e.g. multiple "Road type"s, some "one way" (but which way?), etc.
    Thus mostly useless for SQL queries/parsing without lots of cleanup.

processing:

The first step is to create three lookup tables for nodes, segs, and
routes.

Step two is to populate a "segment_id|x1|y1|x2|y2" table from the
"seg_id|from_node|to_node" and node tables. These lines are then
reformatted into GRASS vector ASCII format, then loaded with v.in.ascii.
I construct the poly-lines from these line segments in GRASS as I don't
trust that all segments will be ordered end-to-end in the routes.

Step three is to create a routes table linking to csv seg ids, and atts:
$RTE_ID|$SEGS|$NAME

Finally the GRASS loop: v.extract the segments for each route defined
by the table in the last step, then build a polyline from them with
v.build.polylines of the given route_id cat. Patch this in to an
aggregate vector. While we are at it, compose a SQL command to connect
the "name" attribute to the correct route_id cat. Once finished
the DB is linked and db.execute run to load values into the DB.

v.clean, v.net.* etc can then be attempted.

There is much room for improvement here, but I think it's a good start.

problem:

while [ $ERRORCODE -eq 0 ] ; do
  read LINE
  ERRORCODE=$?
  test $ERRORCODE && continue
  if `echo $LINE | grep` ; then
    do_minimal_stuff()
  fi
done < 100mb_file.txt

bash takes 800mb ram (that's ok, I've got lots, no swapping) but runs
*incredibly* slowly. Like 80486-SX slowly.

why is that? What's a better way of working through lines containing
spaces? Set the file as a fd to pass to `read` instead of via redirect?

Hamish
-------------- next part --------------
#!/bin/sh
# Open Street Map to GRASS GIS 6 vector import script
#   by Hamish Bowman, Dunedin, New Zealand, 8 May 2006
#   (c) 2006 Hamish Bowman, and the GRASS Development Team
#   Licensed under the GPL v>=2, see GPL.txt in the GRASS source code
#
#  http://www.openstreetmap.org
#  http://grass.itc.it
#
# This script is highly inefficent and scales poorly.
# It is intended to demonstrate the method more clearly than a complex awk script
#  and to be understood by more readers than a C or Matlab script.
#  Maybe Python or Perl would be best?
#  Or via queries to a PostGIS database?

#### create node and segment tables
#node|x|y
cat planet.osm | grep "node id=" | sed \
  -e 's/  <node id="//' -e 's/" lat=//'  -e 's/" lon=//' \
  -e 's/">$//' -e 's+"/>$++'  | tr '"' '|' | \
  awk -F'|' '{print $1 "|" $3 "|" $2}'> planet_pt.dat

#segment|node_from|node_to
cat planet.osm | grep "segment id=" | sed \
  -e 's/  <segment id="//' -e 's/" from=//'  -e 's/" to=//' \
  -e 's/">$//' -e 's+"/>$++'  | tr '"' '|' > planet_seg.dat

#### create segment table fully populated with coordinates
num_segs=`wc -l planet_seg.dat | cut -f1 -d' '`
i=1
echo "segment|x1|y1|x2|y2" > planet_seg_lines.dat
for LINE in `cat planet_seg.dat` ; do
   SEG_ID=`echo "$LINE" | cut -f1 -d'|'`
   FROM_NODE=`echo "$LINE" | cut -f2 -d'|'`
   TO_NODE=`echo "$LINE" | cut -f3 -d'|'`
#   echo "seg $SEG_ID from $FROM_NODE   to $TO_NODE"
   FROM_COORD=`grep "^${FROM_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`
   TO_COORD=`grep "^${TO_NODE}|" planet_pt.dat | cut -f2,3 -d'|'`

   if [ 0 -eq `echo $i | awk '{print $1 % 1000}'` ] ; then
      echo "seg $i of $num_segs"
   fi
   echo "$SEG_ID|$FROM_COORD|$TO_COORD" >> planet_seg_lines.dat

   i=`expr $i + 1`
done

#### create GRASS vector ascii file containing line segments
echo "# Open Street Map segments" > planet_seg.vasc
i=1
for LINE in `cat planet_seg_lines.dat | tail -n+2` ; do
   SEG_ID=`echo "$LINE" | cut -f1 -d'|'`
   FROM_COORD=`echo "$LINE" | cut -f2,3 -d'|' | tr '|' ' '`
   TO_COORD=`  echo "$LINE" | cut -f4,5 -d'|' | tr '|' ' '`
   echo "L 2 1" >> planet_seg.vasc
   echo " $FROM_COORD" >> planet_seg.vasc
   echo " $TO_COORD" >> planet_seg.vasc
   echo " 1 $SEG_ID" >> planet_seg.vasc
   i=`expr $i + 1`
done

v.in.ascii -t in=planet_seg.vasc out=planet_seg format=standard

#### create routes table linking to segment ids
num_rte=`grep '<way id=' planet.osm | wc -l` #4475
num_inlines=`wc -l planet.osm | awk '{print $1}'`
echo "# Open Street Map routes (aka ways)" > planet_rte.dat
echo "# way_id|segments|name" >> planet_rte.dat
i=0
j=1
in_rte=0
EXITCODE=0
while [ $EXITCODE -eq 0 ] ; do
   unset LINE
   read -n 256 LINE
   EXITCODE=$?
   if [ $EXITCODE -ne 0 ] || [ -z "$LINE" ] ; then
      continue
   fi

   if [ 0 -eq `echo $j | awk '{print $1 % 500}'` ] ; then
      echo "processing line $j of $num_inlines"
   fi
   j=`expr $j + 1`

   if [ -n "`echo "$LINE" | grep '<way id='`" ] ; then
      i=`expr $i + 1`
      RTE_ID=`echo "$LINE" | cut -f2 -d'"'`
      in_rte=1
      unset SEGS
      unset seg
      unset NAME
      continue
   fi
   if [ $in_rte -eq 1 ] ; then
      if [ -n "`echo "$LINE" | grep '<seg id='`" ] ; then
         seg=`echo "$LINE" | cut -f2 -d'"'`
         SEGS="${SEGS},$seg"
         continue
      fi
      if [ -n "`echo "$LINE" | grep -i 'k="name"'`" ] ; then
         NAME=`echo "$LINE" | sed -e 's/.* v=//' | cut -f2 -d'"'`
         continue
      fi
      if [ -n "`echo "$LINE" | grep '</way>'`" ] ; then
         SEGS=`echo $SEGS | sed -e 's/^,//'`
         echo "$RTE_ID|$SEGS|$NAME" >> planet_rte.dat
         in_rte=0
         unset RTE_ID
         if [ 0 -eq `echo $i | awk '{print $1 % 100}'` ] ; then
            echo "route $i of $num_rte"
         fi
      fi
   fi
done < planet.osm

#### create SQL command list to populate route Database
max_name=`cat planet_rte.dat | cut -f3 -d'|' | wc --max-line-length`
start=1
rm -f planet_sql.txt
EXITCODE=0
while [ $EXITCODE -eq 0 ] ; do
   read LINE
   EXITCODE=$?
   if [ $EXITCODE -eq 1 ] || [ -z "$LINE" ] ; then
      continue
   fi
   if [ "`echo "$LINE" | cut -c1`" = "#" ] ; then
         continue
   fi

   RTE_ID=`echo "$LINE" | cut -f1 -d'|'`
   SEGS=`echo "$LINE" | cut -f2 -d'|'`
   NAME=`echo "$LINE" | cut -f3 -d'|'`

   num_segs="`echo "$SEGS" | sed -e 's/[0-9]//g'`"
   echo
   echo "Route $RTE_ID  [$NAME]  [,$num_segs]"

   v.extract in=planet_seg list=$SEGS out=tmp_segs_${RTE_ID}

   NUM_VALID_LINES=`v.info tmp_segs_${RTE_ID} | grep "Number of lines" | cut -f2 -d: | awk '{print $1}'`
   if [ "$NUM_VALID_LINES" -eq 0 ] ; then
      echo " no data"
      g.remove vect=tmp_segs_${RTE_ID}
      continue
   fi

   v.build.polylines -q in=tmp_segs_${RTE_ID} out=tmp_rte1_${RTE_ID}
   v.category in=tmp_rte1_${RTE_ID} out=tmp_rte_${RTE_ID} cat=$RTE_ID step=0
   if [ $start -eq 1 ] ; then
      g.rename vect=tmp_rte_${RTE_ID},planet_rte
      start=0
   else
      g.rename v=planet_rte,planet_rte_PREV
      v.patch in=planet_rte_PREV,tmp_rte_${RTE_ID} out=planet_rte
      if [ -n "$NAME" ] ; then
         NAME=`echo "$NAME" | sed -e "s/'/''/"`
         echo "UPDATE planet_rte SET name='$NAME' WHERE cat=${RTE_ID};" >> planet_sql.txt
      fi

   fi
   g.remove v=tmp_segs_${RTE_ID},tmp_rte1_${RTE_ID}
   g.remove vect=tmp_rte_${RTE_ID}
   g.remove vect=planet_rte_PREV

done < planet_rte.dat

# seg list may be >80,200 chars so skip adding it to DBF table
v.db.addtable map=planet_rte column="name varchar($max_name)"
db.execute input=planet_sql.txt