[gdal-dev] Simple schema support for GeoJSON

Even Rouault even.rouault at spatialys.com
Fri Nov 21 07:02:55 PST 2014


Le vendredi 21 novembre 2014 15:35:43, Rahkonen Jukka (Tike) a écrit :
> Hi,
> 
> I have no use for this feature myself but by reading various mailing lists
> and forums I have learned that many people consider it is always a good
> idea to read data for example from WFS services as GeoJSON instead of GML.

Because it consumes less bandwidth ?

For the record, if you try the following, it will use the GML schema for the user
exposed layer and will do a on-the-fly transform from the hidden GeoJSON layer schema
to the GML schema, similarly to the one you could do with a CAST/VRT.

$ ogrinfo "WFS:http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=json" -ro -al -where "STATE_NAME = 'California'"

Layer name: topp:states
Geometry: Multi Polygon
Feature Count: 1
Extent: (-124.391472, 32.535725) - (-114.124451, 42.002346)
Layer SRS WKT:
GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0,
        AUTHORITY["EPSG","8901"]],
    UNIT["degree",0.0174532925199433,
        AUTHORITY["EPSG","9122"]],
    AUTHORITY["EPSG","4326"]]
gml_id: String (0.0)
STATE_NAME: String (0.0)
STATE_FIPS: String (0.0)
SUB_REGION: String (0.0)
STATE_ABBR: String (0.0)
LAND_KM: Real (0.0)
WATER_KM: Real (0.0)
PERSONS: Real (0.0)
FAMILIES: Real (0.0)
HOUSHOLD: Real (0.0)
MALE: Real (0.0)
FEMALE: Real (0.0)
WORKERS: Real (0.0)
DRVALONE: Real (0.0)
CARPOOL: Real (0.0)
PUBTRANS: Real (0.0)
EMPLOYED: Real (0.0)
UNEMPLOY: Real (0.0)
SERVICE: Real (0.0)
MANUAL: Real (0.0)
P_MALE: Real (0.0)
P_FEMALE: Real (0.0)
SAMP_POP: Real (0.0)
OGRFeature(topp:states):0
  gml_id (String) = (null)
  STATE_NAME (String) = California
  STATE_FIPS (String) = 06
  SUB_REGION (String) = Pacific
  STATE_ABBR (String) = CA
  LAND_KM (Real) = 403970.143
  WATER_KM (Real) = 20023.368
  PERSONS (Real) = 29760021
  FAMILIES (Real) = 7139394
  HOUSHOLD (Real) = 10381206
  MALE (Real) = 14897627
  FEMALE (Real) = 14862394
  WORKERS (Real) = 11306576
  DRVALONE (Real) = 9982242
  CARPOOL (Real) = 2036025
  PUBTRANS (Real) = 685797
  EMPLOYED (Real) = 13996309
  UNEMPLOY (Real) = 996502
  SERVICE (Real) = 3664771
  MANUAL (Real) = 1798201
  P_MALE (Real) = 0.501
  P_FEMALE (Real) = 0.499
  SAMP_POP (Real) = 3792553
  MULTIPOLYGON (((....)))

> I can easily imagine that there will be troubles with guess-by-data method
> if they are making subsequent requests from the service. For example
> strings which are all numbers but which may contain leading zeroes are
> saved either to integers or strings  if leading zeroes are interpreted
> right at all. 

In JSON, "00123" and 00123 are different objects. So a string with leading zeros should be serialized as "00123" and not 00123. If it is serialized as "00123", the GeoJSON driver will interpret it as a 
string.

> Or floats which do not always contain decimals, or list
> attributes which sometimes have only zero or one member.

Yes, those cases could cause issues.

> 
> Embedded schema feels optimal because then it would always travel together
> with the data and we all have probably lost .tfw or .prj files sometimes.
> 
> -Jukka-
> 
> Even Rouault wrote:
> > Jukka,
> > 
> > Data type guessing implemented in the OGR GeoJSON driver is quite natural
> > hopefully.
> > A whole scan of the GeoJSON file is made and the following rules are
> > applied : - if an attribute has integer-only content --> Integer
> > - if an attribute has an array of integer-only content  --> IntegerList
> > - if an attribute has integer or floating point content --> Real
> > - if an attribute has an array of integer or floating point content -->
> > RealList - if an attribute has an array of anything else content -->
> > StringList - otherwise --> String
> > 
> > With RFC 50 and other pending improvements in the driver:
> > - if an attribute has boolean-only content --> Integer(Boolean)
> > - if an attribute has an array of boolean-only content -->
> > IntegerList(Boolean) - if an attribute has date-only content --> Date
> > - if an attribute has time-only content --> Time
> > - if an attribute has datetime or date content --> DateTime
> > 
> > I'm not sure we want to invent a .jsont format, but if you download
> > http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py
> > 
> > and run  :
> > 
> > python ogr2vrt.py
> > "http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request
> > =getfeature&typename=topp:states&outputformat=json" test.vrt
> > 
> > This will create you a VRT with the default schema, that you can easily
> > edit. Note: as with OGR SQL CAST, this is post processing. So if the
> > guess done by the GeoJSON driver leads to a loss of information, you
> > cannot recover it. Hopefully the implemented rules will not lead to
> > information loss.
> > 
> > A better approach would be to have the schema embedded in a JSON way in
> > the GeoJSON file itself.
> > That could be an evolution of the format, but I'm not sure this would be
> > really popular, given JSON/GeoJSON is heavily used by NoSQL
> > approaches...
> > 
> > Hum, doing a quick search, I just found http://json-schema.org/ that
> > appears to be an IETF draft.
> > It doesn't look that the schema is embedded in the data file itself.
> > 
> > There's also GeoJSON-LD that might be a bit related :
> > https://github.com/geojson/geojson-ld
> > 
> > CC'ing Sean in case he has thoughts on this.
> > 
> > Even
> > 
> > > Hi,
> > > 
> > > I wonder if GDAL could have some simple and relatively user friendly
> > > way for defining a schema for GeoJSON data. The GeoJSON driver seems
> > > to guess the data types of attributes with some undocumented way but
> > > users could have better knowledge about the desired schema.
> > > 
> > > I know I can control the data type by using OGR SQL and CAST as in
> > > ogrinfo -sql "select cast(EMPLOYED as float) from OGRGeojson"
> > > states.json -so
> > > 
> > > However, perhaps GeoJSON is enough popular for deserving an easier way
> > > for writing a schema. First I thought that it would be enough to copy
> > > the "csvt" text file mechanism from the GDAL CSV driver
> > > http://www.gdal.org/drv_csv.html. However, the csvt file is a plain
> > > list of types which will be applied to the attributes in the same
> > > order than they appear in the text file
> > > "Integer(5)","Real(10.7)","String(15)"
> > > 
> > > For GeoJSON it would feel more user friendly to include the attribute
> > > names in the list somehow like
> > > "population;Integer(5)","area;Real(10.7)","name;String(15)".
> > > 
> > > This would make it easier for users to write a valid "jsont" file. A
> > > list with attribute names could perhaps also help GDAL as well because
> > > the features in GeoJSON file do not necessarily have same attributes.
> > > 
> > > As an example this is the right schema for a WFS feature type which is
> > > captured from
> > > http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&reques
> > > t=des
> > > cribefeaturetype&typename=topp:states
> > > 
> > > 
> > > name="the_geom" type="gml:MultiPolygonPropertyType"/>
> > > name="STATE_NAME" type="xsd:string"/>
> > > name="STATE_FIPS" type="xsd:string"/>
> > > name="SUB_REGION" type="xsd:string"/>
> > > name="STATE_ABBR" type="xsd:string"/>
> > > name="LAND_KM" type="xsd:double"/>
> > > name="WATER_KM" type="xsd:double"/>
> > > name="PERSONS" type="xsd:double"/>
> > > name="FAMILIES" type="xsd:double"/>
> > > name="HOUSHOLD" type="xsd:double"/>
> > > name="MALE" type="xsd:double"/>
> > > name="FEMALE" type="xsd:double"/>
> > > name="WORKERS" type="xsd:double"/>
> > > name="DRVALONE" type="xsd:double"/>
> > > name="CARPOOL" type="xsd:double"/>
> > > name="PUBTRANS" type="xsd:double"/>
> > > name="EMPLOYED" type="xsd:double"/>
> > > name="UNEMPLOY" type="xsd:double"/>
> > > name="SERVICE" type="xsd:double"/>
> > > name="MANUAL" type="xsd:double"/>
> > > name="P_MALE" type="xsd:double"/>
> > > name="P_FEMALE" type="xsd:double"/>
> > > name="SAMP_POP" type="xsd:double"/>
> > > 
> > > 
> > > This is what GDAL is guessing:
> > > STATE_NAME: String (0.0)
> > > STATE_FIPS: String (0.0)
> > > SUB_REGION: String (0.0)
> > > STATE_ABBR: String (0.0)
> > > LAND_KM: Real (0.0)
> > > WATER_KM: Real (0.0)
> > > PERSONS: Real (0.0)
> > > FAMILIES: Integer (0.0)
> > > HOUSHOLD: Real (0.0)
> > > MALE: Real (0.0)
> > > FEMALE: Real (0.0)
> > > WORKERS: Real (0.0)
> > > DRVALONE: Integer (0.0)
> > > CARPOOL: Integer (0.0)
> > > PUBTRANS: Integer (0.0)
> > > EMPLOYED: Real (0.0)
> > > UNEMPLOY: Integer (0.0)
> > > SERVICE: Integer (0.0)
> > > MANUAL: Integer (0.0)
> > > P_MALE: Real (0.0)
> > > P_FEMALE: Real (0.0)
> > > SAMP_POP: Integer (0.0)
> > > bbox: RealList (0.0)
> > > 
> > > -Jukka Rahkonen-
> > > 
> > > _______________________________________________
> > > gdal-dev mailing list
> > > gdal-dev at lists.osgeo.org
> > > http://lists.osgeo.org/mailman/listinfo/gdal-dev
> > 
> > --
> > Spatialys - Geospatial professional services http://www.spatialys.com
> 
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> http://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com


More information about the gdal-dev mailing list