[GRASS-stats] Loading a point-vector table with 466 columns
Roger Bivand
Roger.Bivand at nhh.no
Wed May 27 04:14:23 EDT 2009
On Tue, 26 May 2009, Nikos Alexandris wrote:
> (Cc to Even Roualt ; Apologies to Even since he is not subscribed in the
> list)
>
> Roger:
>>>> Three minutes instead of thirty+ suggests that the OGR
>>>> plugin has trouble with SQLite as the DB format. So maybe
>>>> the default for plugin= should be FALSE, not NULL and automatic
>>>> use if present?
>
> --%<--
>> Could you, Nikos,
>> make a script generating a similar table in spearfish, and two small
>> scripts exercising the problem (export to R with the plugin, and with
>> the temporary shapefile.
>
> * The "problem" exists also with the default DBF as a back-end. I
> created 1000 random points, filled less than half of the records with
> random numbers and readVECT6("x", plugin=TRUE) takes again too much. I
> broke the process since it was running for more than 20 mins.
OK. With 250 rows and 250 columns, I see an order of magnitude saving with
plugin=FALSE. In plugin=FALSE, the times are split equally between writing
the temporary file from GRASS with v.out.ogr, and reading it into R with
readOGR(), as one might expect (that is all readVECT6(..., plugin=FALSE)
is doing). Even on a small vector (bugsites, 90 points, 2 attribute
columns), plugin=FALSE is faster than plugin=TRUE by about 0.75 : 1.35,
not quite twice. Which way does the problem scale, in numbers of features,
numbers of attribute columns, or both?
Next script in R generating increasing NR and NC cases through
writeVECT6() to test plugin=FALSE/plugin=TRUE ratios?
Roger
>
> * A script is pasted on the bottom which has a small "bug" (details
> below) :-)
>
>
> First some results for 1000 rows by 500 columns:
>
>> system.time(random_points <- readVECT6("random_points_1000",
> plugin=TRUE))
> OGR data source with driver: GRASS
> Source: "/geo/grassdb/spearfish60/user1/vector/random_points_1000/head",
> layer: "1"
> with 1000 rows and 501 columns
> ^C
> ### This was running for more than 10 hours !!! ###
>
>
>> system.time(random_points <- readVECT6("random_points_1000",
> plugin=FALSE))
> Exporting 1000 points/lines...
> 100%
> 1000 features written
> OGR data source with driver: ESRI Shapefile
> Source: "/geo/grassdb/spearfish60/user1/.tmp/vertical", layer:
> "random_p"
> with 1000 rows and 501 columns
> Feature type: wkbPoint with 2 dimensions
> user system elapsed
> 62.515 9.256 74.013
>
>
>> system.time(random_points <- read.csv("random_points_1000_table.csv"))
> user system elapsed
> 0.192 0.000 0.192
>
>
>
> * A script to generate "some" random points, add columns and some R-code
> to load with readVECT6( plugin = TRUE ), readVECT6( plugin = FALSE ) and
> read.csv.
>
> * The "bug" is that while the variable NUMBER="$[ ( $RANDOM % 100 ) +
> 1 ]" runs ok under the CLI, it doesn't work from within the bash
> script!? So I've commented the respective line and use a fixed number
> instead.
>
> --%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-
> #!/bin/bash
>
> # example that readVECT6 ( x , plugin = TRUE ) is too slow
> # (also) using the default DBF driver
> # first enter in spearfish60/user1
>
> # try with a different back-end?
> # db.connect driver=sqlite database=
>
> # set numbers here:
> RANDOM_POINTS=100 ; RANDOM_POINTS_CATS=100 ; NUMBER=111
>
> # create RANDOM_POINTS random points
> v.random --o output=random_points_`echo ${RANDOM_POINTS}` n=`echo
> ${RANDOM_POINTS}`
>
> # add in database
> v.db.addtable random_points_`echo ${RANDOM_POINTS}`
>
>
> # add $"{RANDOM_POINTS}" columns
> echo "\n* Adding ${RANDOM_POINTS} columns"
> for x in `seq 1 ${RANDOM_POINTS}` ; do
> v.db.addcol random_points_`echo ${RANDOM_POINTS}` column="col_"${x}"
> integer"
> done ; echo "\n* ${RANDOM_POINTS} columns added"
>
>
> # check if columns are added
> v.info -c random_points_${RANDOM_POINTS}
>
>
>
> ## WARNING: double loop below takes too long!
> # --%<--
> # It is simpler and faster to use a single loop with a fixed value
> instead, e.g.:
> #for COL in `seq 1 5 ${RANDOM_POINTS}` ; do
> # v.db.update random_points_${RANDOM_POINTS} column="col_"${COL}""
> value=222
> #done
> # --%<--
>
>
> # fill some columns/cats with random numbers between 1 and 100
> # alter sequence as desired ; more numbers = more time to load in R
> for COL in `seq 1 10 ${RANDOM_POINTS}` ; do
> for CAT in `seq 1 10 ${RANDOM_POINTS_CATS}` ; do
> # this is ok in the command line but NOT when running the script?
> #NUMBER="$[ ( $RANDOM % 100 ) + 1 ]"
> v.db.update random_points_${RANDOM_POINTS} column="col_"${COL}""
> value=${NUMBER} where="cat="${CAT}""
> done
> done
>
>
> # [optional] fill in some "-999" values to use as NAs in R?
> #NAN=-999
> #for COL in `seq 1 5 $"{RANDOM_POINTS}"` ; do
> # for CAT in `seq 1 5 $"{RANDOM_POINTS_CATS}"` ; do
> # v.db.update random_points_$"{RANDOM_POINTS}" column="col_"${COL}""
> value=$"{NAN}" where="cat="${CAT}""
> # done
> #done
>
> # check with v.db.select
> # v.db.select random_points_${RANDOM_POINTS} | head -25
>
> # export table as .csv file
> db.out.ogr in=random_points_${RANDOM_POINTS} format=CSV
> dsn=/geo/grassdb/spearfish60/random_points_csv_files
> db_table=random_points_${RANDOM_POINTS}.csv
>
> ### end of bash script ###
>
>
> ## launch R
> R
> ### R code
>
> # load in R with:
> library(spgrass6) ; G <- gmeta6()
>
> #a. readVECT6()
> system.time ( random_points <- readVECT6 ( "random_points_100" , plugin
> = FALSE ) )
>
> #b. plugin=TRUE
> system.time ( random_points <- readVECT6 ( "random_points_100" , plugin
> = TRUE ) )
>
> #c. as a csv table
> # adjust as required
> setwd("/geo/grassdb/spearfish60/random_points_csv_files")
> table_to_read <- dir ( pattern = "^random.*.csv$" )
> system.time ( random_points <- read.csv ( table_to_read ) )
> str(random_points)
> --%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-
>
>
--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no
More information about the grass-stats
mailing list