[GRASS-stats] Loading a point-vector table with 466 columns

Wed May 27 04:51:34 EDT 2009

On Wed, 27 May 2009, Roger Bivand wrote:

> On Tue, 26 May 2009, Nikos Alexandris wrote:
>
>> (Cc to Even Roualt ; Apologies to Even since he is not subscribed in the
>> list)
>> 
>> Roger:
>>>>> Three minutes instead of thirty+ suggests that the OGR
>>>>> plugin has trouble with SQLite as the DB format. So maybe
>>>>> the default for plugin= should be FALSE, not NULL and automatic
>>>>> use if present?
>> 
>> --%<--
>>> Could you, Nikos,
>>> make a script generating a similar table in spearfish, and two small
>>> scripts exercising the problem (export to R with the plugin, and with
>>> the temporary shapefile.
>> 
>> * The "problem" exists also with the default DBF as a back-end. I
>> created 1000 random points, filled less than half of the records with
>> random numbers and readVECT6("x", plugin=TRUE) takes again too much. I
>> broke the process since it was running for more than 20 mins.
>
> OK. With 250 rows and 250 columns, I see an order of magnitude saving with 
> plugin=FALSE. In plugin=FALSE, the times are split equally between writing 
> the temporary file from GRASS with v.out.ogr, and reading it into R with 
> readOGR(), as one might expect (that is all readVECT6(..., plugin=FALSE) is 
> doing). Even on a small vector (bugsites, 90 points, 2 attribute columns), 
> plugin=FALSE is faster than plugin=TRUE by about 0.75 : 1.35, not quite 
> twice. Which way does the problem scale, in numbers of features, numbers of 
> attribute columns, or both?
>
> Next script in R generating increasing NR and NC cases through writeVECT6() 
> to test plugin=FALSE/plugin=TRUE ratios?

And we also need to check whether the same applies to use of the plugin in 
other settings - I'm pretty certain this isn't just differential behaviour 
in readOGR() between the OGR shapefile driver and the OGR GRASS vector 
driver. Could someone test v.out.ogr against ogr2ogr using the plugin?

Roger

>
> Roger
>
>> 
>> * A script is pasted on the bottom which has a small "bug" (details
>> below) :-)
>> 
>> 
>> First some results for 1000 rows by 500 columns:
>> 
>>> system.time(random_points <- readVECT6("random_points_1000",
>> plugin=TRUE))
>> OGR data source with driver: GRASS
>> Source: "/geo/grassdb/spearfish60/user1/vector/random_points_1000/head",
>> layer: "1"
>> with  1000  rows and  501  columns
>> ^C
>> ### This was running for more than 10 hours !!! ###
>> 
>> 
>>> system.time(random_points <- readVECT6("random_points_1000",
>> plugin=FALSE))
>> Exporting 1000 points/lines...
>> 100%
>> 1000 features written
>> OGR data source with driver: ESRI Shapefile
>> Source: "/geo/grassdb/spearfish60/user1/.tmp/vertical", layer:
>> "random_p"
>> with  1000  rows and  501  columns
>> Feature type: wkbPoint with 2 dimensions
>>   user  system elapsed
>> 62.515   9.256  74.013
>> 
>> 
>>> system.time(random_points <- read.csv("random_points_1000_table.csv"))
>>   user  system elapsed
>>  0.192   0.000   0.192
>> 
>> 
>> 
>> * A script to generate "some" random points, add columns and some R-code
>> to load with readVECT6( plugin = TRUE ), readVECT6( plugin = FALSE ) and
>> read.csv.
>> 
>> * The "bug" is that while the variable NUMBER="$[ ( $RANDOM % 100 ) +
>> 1 ]" runs ok under the CLI, it doesn't work from within the bash
>> script!? So I've commented the respective line and use a fixed number
>> instead.
>> 
>> --%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-
>> #!/bin/bash
>> 
>> # example that  readVECT6 ( x , plugin = TRUE )  is too slow
>> # (also) using the default DBF driver
>> # first enter in spearfish60/user1
>> 
>> # try with a different back-end?
>> # db.connect driver=sqlite database=
>> 
>> # set numbers here:
>> RANDOM_POINTS=100 ; RANDOM_POINTS_CATS=100 ; NUMBER=111
>> 
>> # create RANDOM_POINTS random points
>> v.random --o output=random_points_`echo ${RANDOM_POINTS}` n=`echo
>> ${RANDOM_POINTS}`
>> 
>> # add in database
>> v.db.addtable random_points_`echo ${RANDOM_POINTS}`
>> 
>> 
>> # add   $"{RANDOM_POINTS}"   columns
>> echo "\n* Adding ${RANDOM_POINTS} columns"
>> for x in `seq 1 ${RANDOM_POINTS}` ; do
>> v.db.addcol random_points_`echo ${RANDOM_POINTS}` column="col_"${x}"
>> integer"
>> done ; echo "\n* ${RANDOM_POINTS} columns added"
>> 
>> 
>> # check if columns are added
>> v.info -c random_points_${RANDOM_POINTS}
>> 
>> 
>> 
>> ## WARNING: double loop below takes too long!
>> # --%<--
>> # It is simpler and faster to use a single loop with a fixed value
>> instead, e.g.:
>>  #for COL in `seq 1 5 ${RANDOM_POINTS}` ; do
>>  # v.db.update random_points_${RANDOM_POINTS} column="col_"${COL}""
>> value=222
>>  #done
>> # --%<--
>> 
>> 
>> # fill some columns/cats with random numbers between 1 and 100
>> # alter sequence as desired ; more numbers = more time to load in R
>> for COL in `seq 1 10 ${RANDOM_POINTS}` ; do
>> for CAT in `seq 1 10 ${RANDOM_POINTS_CATS}` ; do
>>  # this is ok in the command line but NOT when running the script?
>>  #NUMBER="$[ ( $RANDOM % 100 ) + 1 ]"
>>  v.db.update random_points_${RANDOM_POINTS} column="col_"${COL}""
>> value=${NUMBER} where="cat="${CAT}""
>> done
>> done
>> 
>> 
>> # [optional] fill in some "-999" values to use as NAs in R?
>> #NAN=-999
>> #for COL in `seq 1 5 $"{RANDOM_POINTS}"` ; do
>> # for CAT in `seq 1 5 $"{RANDOM_POINTS_CATS}"` ; do
>> #  v.db.update random_points_$"{RANDOM_POINTS}" column="col_"${COL}""
>> value=$"{NAN}" where="cat="${CAT}""
>> # done
>> #done
>> 
>> # check with v.db.select
>> # v.db.select random_points_${RANDOM_POINTS} | head -25
>> 
>> # export table as .csv file
>> db.out.ogr in=random_points_${RANDOM_POINTS} format=CSV
>> dsn=/geo/grassdb/spearfish60/random_points_csv_files
>> db_table=random_points_${RANDOM_POINTS}.csv
>> 
>> ### end of bash script ###
>> 
>> 
>> ## launch R
>> R
>> ### R code
>> 
>> # load in R with:
>> library(spgrass6) ; G <- gmeta6()
>> 
>> #a. readVECT6()
>> system.time ( random_points <- readVECT6 ( "random_points_100" , plugin
>> = FALSE ) )
>> 
>> #b. plugin=TRUE
>> system.time ( random_points <- readVECT6 ( "random_points_100" , plugin
>> = TRUE ) )
>> 
>> #c. as a csv table
>> # adjust as required
>> setwd("/geo/grassdb/spearfish60/random_points_csv_files")
>> table_to_read <- dir ( pattern = "^random.*.csv$" )
>> system.time ( random_points <- read.csv ( table_to_read ) )
>> str(random_points)
>> --%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-
>> 
>> 
>
>

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no