[GRASS-dev] GRASS usage on a cluster: thread safety

Markus Neteler neteler at osgeo.org
Sun Jan 9 06:25:39 EST 2011


Hi,

I am using 6.4 on a cluster to process maps in parallel.
Each job runs as batch job in its own mapset. The jobs
are launched via 'qsub' of Grid Engine which
sends it from the frontend to the various blades of
the cluster. The grassdata directory is shared via
NFS on all blades, the filesystem is XFS.

Unfortunately, with fast tasks within the batch job,
various mysterious errors randomly occur:

...
./launch_SGE_grassjob_MODIS_filt2.sh.e255664:ERROR:
/usr/local/grass-6.4.1svn/etc/lock:
./launch_SGE_grassjob_MODIS_filt2.sh.e254776:ERROR:
/usr/local/grass-6.4.1svn/etc/lock:
./launch_SGE_grassjob_MODIS_filt2.sh.e256639:ERROR: G_getenv():
Variable LOCATION_NAME not set
./launch_SGE_grassjob_MODIS_filt2.sh.e257016:ERROR: Unable to make
mapset element .tmp/blade08
./launch_SGE_grassjob_MODIS_filt2.sh.e255264:ERROR: MAPSET
terra_lst1km20010420.LST_Night_1km.filt.255184 not found
./launch_SGE_grassjob_MODIS_filt2.sh.e256528:ERROR: MAPSET
terra_lst1km20021203.LST_Night_1km.filt.256430 not found
./launch_SGE_grassjob_MODIS_filt2.sh.e256254:ERROR: MAPSET
terra_lst1km20021207.LST_Night_1km.filt.256434 not found
./launch_SGE_grassjob_MODIS_filt2.sh.e256415:ERROR:
/usr/local/grass-6.4.1svn/etc/lock:
./launch_SGE_grassjob_MODIS_filt2.sh.e254717:ERROR: Unable to make
mapset element .tmp/blade02
./launch_SGE_grassjob_MODIS_filt2.sh.e256033:ERROR: Unable to make
mapset element .tmp/blade07
./launch_SGE_grassjob_MODIS_filt2.sh.e256722:ERROR: Unable to make
mapset element .tmp/blade07
./launch_SGE_grassjob_MODIS_filt2.sh.e257642:ERROR: Unable to make
mapset element .tmp/blade11
./launch_SGE_grassjob_MODIS_filt2.sh.e255185:ERROR: Unable to make
mapset element .tmp/blade03
./launch_SGE_grassjob_MODIS_filt2.sh.e254745:ERROR: Unable to make
mapset element .tmp/blade02
./launch_SGE_grassjob_MODIS_filt2.sh.e255088:ERROR: G_getenv():
Variable LOCATION_NAME not set
./launch_SGE_grassjob_MODIS_filt2.sh.e256473:ERROR: Unable to make
mapset element .tmp/blade08
./launch_SGE_grassjob_MODIS_filt2.sh.e257003:ERROR: Unable to make
mapset element .tmp/blade12
./launch_SGE_grassjob_MODIS_filt2.sh.e257257:ERROR: MAPSET
aqua_lst1km20031222.LST_Night_1km.filt.257168 not found
./launch_SGE_grassjob_MODIS_filt2.sh.e257696:ERROR: MAPSET
terra_lst1km20030310.LST_Night_1km.filt.257607 not found
...

I wonder how this can happen if the jobs are launched
independently.
About 10-20% of the jobs are affected (n=13600). My
current approach is to relaunch errorenous
jobs unless all are done but that's rather annoying...

How to track that down?

Markus


More information about the grass-dev mailing list