[GRASS-stats] Re: [R-sig-Geo] writing shapefiles / DBF files when input data contains NA

Dylan Beaudette dylan.beaudette at gmail.com
Tue Oct 7 11:04:39 EDT 2008


On Tue, Oct 7, 2008 at 12:10 AM, Roger Bivand <Roger.Bivand at nhh.no> wrote:
> On Mon, 6 Oct 2008, Dylan Beaudette wrote:
>
>> Hi,
>>
>> I have noticed that saving data to files that include a DBF, result in
>> bogus data where there were NA. Using the write.dbf() function from
>> the foreign package seems to work a little better, but I still get odd
>> results in numeric columns. Writing to GRASS with the methods in the
>> spgrass6 package results in some thing that looks like this:
>>
>
> Dylan,
>
> I'm afraid that there is no good solution for this at all. DBF does not seem
> to have a clear and uniform NA treatment (or even !is.finite() treatment).
> The only work-around is to preprocess the data.frame in the output object to
> insert known NODATA values, and to replace those flags manually on the GRASS
> side. This could possibly be written as a wrapper around writeVECT6(). The
> help page does say:
>
>    "Please note that the OGR drivers used may not handle missing data
>     gracefully, and be prepared to have to correct for this manually.
>     For example use of the 'readOGR' PostGIS driver directly may
>     perform better than moving the data through the DBF driver used in
>     this function - or a PostgreSQL driver used directly or through
>     ODBC may be a solution. Do not rely on missing values of vector
>     data moving smoothly across the interface."


Hi Roger, thanks for the ideas. I had a sneaking suspicion that there
wasn't going to be a quick solution for this deep-seated problem with
the DBF driver. In this situation I was stuck with using the DBF
driver, however I am now considering alternative routes. I wonder if
there are other vector formats that can be used as intermediate files.
I will post on the gdal mailing list asking about this.


> I did try to look at the SQLite driver on the GRASS side, which might be
> more robust, but did not see how to proceed.

Yes. I should really switch the back-end for all GRASS data that I am
using to SQLite / PostgreSQL. However, does writeOGR() build an
intermediate shapefile, and then import it into GRASS? if that is the
case, we may be able to find a more sane intermediate format. The
column name truncation is also a pain in the neck. I am almost
positive that I have seen proper use of nodata *within* GRASS, but
defaulting to something like 0 when NA was encoded in a numeric column
upon export to shapefile.

Spearfish example:

# make a copy of one of the sample vector files
g.copy vect=vegcover,veg_temp

# add a new double precision column
# should be filled with NULL
v.db.addcol veg_temp column="x double"

# check within GRASS:
db.select veg_temp
cat|label|x
1|irrigated agriculture|
2|rangeland|
3|coniferous forest|
4|deciduous forest|
5|mixed forest|
6|disturbed|

# check outside of GRASS:
dbfxtrct -iveg_temp.dbf
1,irrigated agriculture,********************
2,rangeland,********************
3,coniferous forest,********************
4,deciduous forest,********************
5,mixed forest,********************
6,disturbed,********************

# OK- it looks like there *is* some way to encode nodata into a DBF
file for numeric data
# looking at the GRASS description of the columns:

column:x
description:
type:DOUBLE PRECISION
len:20
scale:6
precision:18
default:
nullok:yes
select:yes
update:yes

# cleanup
g.remove vect=v2,veg_temp


## another example:
g.copy vect=archsites,a_temp
v.db.addcol a_temp column="xyz double, abc integer"

# check:
db.select a_temp
cat|str1|xyz|abc
1|Signature Rock||
2|No Name||
3|Canyon Station||
[...]

# check outside of GRASS:
bfxtrct -ia_temp.dbf
1,Signature Rock,********************,***********
2,No Name,********************,***********
3,Canyon Station,********************,***********
4,Spearfish Creek,********************,***********
[...]

# looks like encoding nodata for 'double' is different than for 'integer'

# try exporting through GDAL/OGR:
# first as shapefile:
v.out.ogr in=a_temp dsn=. type=point

# check:
dbfxtrct -ia_temp.dbf
1,Signature Rock,0.000000000000000,0
2,No Name,0.000000000000000,0
3,Canyon Station,0.000000000000000,0
4,Spearfish Creek,0.000000000000000,0

## not as bad as junk data, but 0 is not NA...

# try another vector driver: MapInfo:
v.out.ogr in=a_temp dsn=. type=point format=MapInfo_File

OGRFeature(a_temp):1
  cat (Integer) = 1
  str1 (String) = Signature Rock
  xyz (Real) = 0
  abc (Integer) = 0

## still wrong, trying GML
v.out.ogr in=a_temp dsn=output_dir layer=a_temp type=point format=GML

## null columns aren't exported..., trying SQLite
v.out.ogr in=a_temp dsn=a_temp.db layer=a_temp type=point format=SQLite

ERROR 1: sqlite3_step() failed:
  SQL logic error or missing database
ERROR 1: sqlite3_step() failed:
  SQL logic error or missing database
[...]

## results in a broken sqlite DB, with no data, and an incomplete schema

# one more try, save a dbf file from R using the foreign package write.dbf()
library(foreign)
x <- data.frame(a=1:10, b=1:10 + runif(10))
x[10,] <- c(NA,NA)
write.dbf(x, file='x.dbf')

# looks ok:
dbfxtrct -ix.dbf
1,1.967017974471673
2,2.275017686421052
3,3.295665678335354
4,4.817330794176087
5,5.371476975502446
6,6.743488555075601
7,7.345701594371349
8,8.887322778813541
9,9.570328279398382
****,*******************

# try reading into GRASS:
db.in.ogr dsn=x.dbf out=x

# check in GRASS:
db.select x
a|b
1|1.967018
2|2.275018
3|3.295666
4|4.817331
5|5.371477
6|6.743489
7|7.345702
8|8.887323
9|9.570328
|                   <------- correct use of NA


## some success!
It looks like writing the attribute data using the foreign packages
write.dbf() may be a safer mechanism for exporting the dataframe
attached to sp objects. Obviously it would best to avoid DBF at all
costs, but for the time being, it may be possible to fix the
writeVECT6() and writeOGR() methods by splitting the writing of the
spatia / attribute data. Not sure how robust this is, but...

# load 'a_temp' from prev example into R
library(spgrass6)
library(foreign)
library(foreign)

x <- readVECT6('a_temp')
x at data
   cat                 str1 xyz abc
1    1       Signature Rock   0   0
2    2              No Name   0   0
3    3       Canyon Station   0   0
[...]

# note that NA is not preserved -- probably due to intermediate export
to shapefile.
# re-add some NA
x at data[20:25, c('str1','xyz', 'abc')]  <- c(NA, NA, NA)

# write shapefile, and then re-make dbf
# note that DBF written by OGR driver has bogus data where there was NA
writeOGR(x, driver='ESRI Shapefile', dsn='.', layer='a_temp2')

# replace dbf with one written by write.dbf...
write.dbf(x at data, file='a_temp2.dbf')

# dbf file looks correct... import into GRASS and make sure spatial
data can "see" the re-made DBF:
v.in.ogr dsn=a_temp2.shp out=a_temp2

db.select a_temp2
cat|cat_|str1|xyz|abc
1|1|Signature Rock|0|0
2|2|No Name|0|0
3|3|Canyon Station|0|0
4|4|Spearfish Creek|0|0
5|5|No Name|0|0
6|6|Prairie Site|0|0
7|7|Jensen Pass|0|0
8|8|No Name|0|0
9|9|No Name|0|0
10|10|Slaughterhouse Gulch|0|0
11|11|No Name|0|0
12|12|Elkhorn Peak|0|0
13|13|No Name|0|0
14|14|No Name|0|0
15|15|No Name|0|0
16|16|Boulder Creek Cabin|0|0
17|17|Ridge|0|0
18|18|Cole Creek Mine|0|0
19|19|No Name|0|0
20|20|||
21|21|||
22|22|||
23|23|||
24|24|||
25|25|||

# done with examples.


OK. It looks like the internal ordering of the DBF file remains true
to the original ordering of the source dataframe, and thus matches the
original ordering of records in GRASS. Do you think that it would be
wise, after more testing, to make a second pass on shapefiles that
contain NA to replace the DBF with a corrected one? i.e. in writeOGR()
/ writeVECT6(), do a check for any columns that are not text/factors,
and that contain NA-- if true, then re-write the DBF file with the
write.dbf() function.

I would be happy to test any updates. Of note, I am using these versions:

GRASS: 6.4-svn r32904
R: 2.7.1 (2008-06-23)
GDAL: trunk r15235
spgrass6: 0.5-13
sp: 0.9-26
rgdal: 0.5-26

Hopefully this testing will help iron out a solution.

Cheers,

Dylan

> One possibility is not to recode, but to build an NA mask on the R side, and
> then loop over fields on the GRASS side for the chosen driver inserting NAs
> in the correct rows (whatever the syntax for that might be). Would this be
> db.execute with an insertion of SQL NULL?
>
> Can we redirect this discussion to the statgrass list, because GRASS
> developers follow that list?
>
> Best wishes,
>
> Roger
>
>> ### code snippet:
>> writeVECT6(SDF=spatial.data, vname='pedons_grouped')
>>
>>
>> ### errors:
>> Projection of input dataset and current location appear to match
>> Layer: pedons_g
>> WARNING: Column name changed: 'describer.' -> 'describer_'
>> WARNING: Column name changed: 'cat' -> 'cat_'
>> Importing map 103 features...
>> DBMI-DBF driver error:
>> SQL parser error: @@rror, unexpected NAME processing 'nan'
>> in statement:
>> insert into pedons_grouped values ( 1, 'd2g1', 'alex',
>> 32.311427999999999,      252.434875000000005,     7227.804688000000169,
>> -0.000162000000000,           3,                      nan, 'NA',
>> -2147483648, 'NA', 'NA', -2147483648, -2147483648, 'NA',
>> nan, '1', 'NA' )
>> Error in db_execute_immediate()
>>
>> ERROR: Cannot insert new row: insert into pedons_grouped values ( 1,
>>      'd2g1', 'alex', 32.311427999999999, 252.434875000000005,
>>      7227.804688000000169, -0.000162000000000, 3, nan, 'NA',
>> -2147483648,
>>      'NA', 'NA', -2147483648, -2147483648, 'NA', nan, '1', 'NA' )
>>
>>
>> ### another self-contained example:
>>
>>
>> # load libs
>> library(sp)
>> library(rgdal)
>> library(foreign)
>>
>> # read in xy data and promote to sp object
>> e <-
>> read.csv(url('http://casoilresource.lawr.ucdavis.edu/drupal/files/elev.csv_.txt'))
>> coordinates(e) <- ~ x+y
>>
>> # add a factor column
>> e at data$f <- factor(rep(letters[1:10], each=30))
>>
>> # add some NA
>> e at data$elev[288:300] <- NA
>> e at data$f[288:300] <- NA
>>
>> # save sp object to shapefile
>> writeOGR(e, driver='ESRI Shapefile', dsn='.', layer='pts')
>>
>>
>> # the results from dumping the DBF:
>> [...]
>> 285,1543,j
>> 286,1518,j
>> 287,1656,j
>> 288,-2147483648,NA
>> 289,-2147483648,NA
>> [...]
>>
>>
>> # one more try with the foreign package's write.dbf()
>> write.dbf(e at data, file='second_try.dbf')
>>
>> # results: look better, although the '******' isn't a legal int!
>> [...]
>> 285,1543,j
>> 286,1518,j
>> 287,1656,j
>> 288,*******,
>> 289,*******,
>> [...]
>>
>>
>> Any ideas on how to work with missing data in numeric columns, when
>> the dreaded DBF file is involved??? This is a real show-stopper when
>> sending vector data back to GRASS, as it seems to rely on intermediate
>> files. Maybe it would be a good idea to send  the geometry first, and
>> then the attribute data. There would still be a problem if the DBF
>> back-end is in use...
>>
>> Cheers,
>>
>> Dylan
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
> --
> Roger Bivand
> Economic Geography Section, Department of Economics, Norwegian School of
> Economics and Business Administration, Helleveien 30, N-5045 Bergen,
> Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
> e-mail: Roger.Bivand at nhh.no
>
>


More information about the grass-stats mailing list