[gdal-dev] Simple schema support for GeoJSON

Andreas Oxenstierna ao at t-kartor.se
Fri Nov 21 07:17:49 PST 2014


Hi

The normal reason to select GeoJSON for geoweb applications is that JSON 
is parsed directly by the web browser, i.e. you get JavaScript objects
directly digestable by your JavaScript code. This may be also 
considerable faster than parsing XML.
Bandwidth is more or less irrelevant in comparison.

> Le vendredi 21 novembre 2014 15:35:43, Rahkonen Jukka (Tike) a écrit :
>> Hi,
>>
>> I have no use for this feature myself but by reading various mailing lists
>> and forums I have learned that many people consider it is always a good
>> idea to read data for example from WFS services as GeoJSON instead of GML.
> Because it consumes less bandwidth ?
>
> For the record, if you try the following, it will use the GML schema for the user
> exposed layer and will do a on-the-fly transform from the hidden GeoJSON layer schema
> to the GML schema, similarly to the one you could do with a CAST/VRT.
>
> $ ogrinfo "WFS:http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=json" -ro -al -where "STATE_NAME = 'California'"
>
> Layer name: topp:states
> Geometry: Multi Polygon
> Feature Count: 1
> Extent: (-124.391472, 32.535725) - (-114.124451, 42.002346)
> Layer SRS WKT:
> GEOGCS["WGS 84",
>      DATUM["WGS_1984",
>          SPHEROID["WGS 84",6378137,298.257223563,
>              AUTHORITY["EPSG","7030"]],
>          AUTHORITY["EPSG","6326"]],
>      PRIMEM["Greenwich",0,
>          AUTHORITY["EPSG","8901"]],
>      UNIT["degree",0.0174532925199433,
>          AUTHORITY["EPSG","9122"]],
>      AUTHORITY["EPSG","4326"]]
> gml_id: String (0.0)
> STATE_NAME: String (0.0)
> STATE_FIPS: String (0.0)
> SUB_REGION: String (0.0)
> STATE_ABBR: String (0.0)
> LAND_KM: Real (0.0)
> WATER_KM: Real (0.0)
> PERSONS: Real (0.0)
> FAMILIES: Real (0.0)
> HOUSHOLD: Real (0.0)
> MALE: Real (0.0)
> FEMALE: Real (0.0)
> WORKERS: Real (0.0)
> DRVALONE: Real (0.0)
> CARPOOL: Real (0.0)
> PUBTRANS: Real (0.0)
> EMPLOYED: Real (0.0)
> UNEMPLOY: Real (0.0)
> SERVICE: Real (0.0)
> MANUAL: Real (0.0)
> P_MALE: Real (0.0)
> P_FEMALE: Real (0.0)
> SAMP_POP: Real (0.0)
> OGRFeature(topp:states):0
>    gml_id (String) = (null)
>    STATE_NAME (String) = California
>    STATE_FIPS (String) = 06
>    SUB_REGION (String) = Pacific
>    STATE_ABBR (String) = CA
>    LAND_KM (Real) = 403970.143
>    WATER_KM (Real) = 20023.368
>    PERSONS (Real) = 29760021
>    FAMILIES (Real) = 7139394
>    HOUSHOLD (Real) = 10381206
>    MALE (Real) = 14897627
>    FEMALE (Real) = 14862394
>    WORKERS (Real) = 11306576
>    DRVALONE (Real) = 9982242
>    CARPOOL (Real) = 2036025
>    PUBTRANS (Real) = 685797
>    EMPLOYED (Real) = 13996309
>    UNEMPLOY (Real) = 996502
>    SERVICE (Real) = 3664771
>    MANUAL (Real) = 1798201
>    P_MALE (Real) = 0.501
>    P_FEMALE (Real) = 0.499
>    SAMP_POP (Real) = 3792553
>    MULTIPOLYGON (((....)))
>
>> I can easily imagine that there will be troubles with guess-by-data method
>> if they are making subsequent requests from the service. For example
>> strings which are all numbers but which may contain leading zeroes are
>> saved either to integers or strings  if leading zeroes are interpreted
>> right at all.
> In JSON, "00123" and 00123 are different objects. So a string with leading zeros should be serialized as "00123" and not 00123. If it is serialized as "00123", the GeoJSON driver will interpret it as a
> string.
>
>> Or floats which do not always contain decimals, or list
>> attributes which sometimes have only zero or one member.
> Yes, those cases could cause issues.
>
>> Embedded schema feels optimal because then it would always travel together
>> with the data and we all have probably lost .tfw or .prj files sometimes.
>>
>> -Jukka-
>>
>> Even Rouault wrote:
>>> Jukka,
>>>
>>> Data type guessing implemented in the OGR GeoJSON driver is quite natural
>>> hopefully.
>>> A whole scan of the GeoJSON file is made and the following rules are
>>> applied : - if an attribute has integer-only content --> Integer
>>> - if an attribute has an array of integer-only content  --> IntegerList
>>> - if an attribute has integer or floating point content --> Real
>>> - if an attribute has an array of integer or floating point content -->
>>> RealList - if an attribute has an array of anything else content -->
>>> StringList - otherwise --> String
>>>
>>> With RFC 50 and other pending improvements in the driver:
>>> - if an attribute has boolean-only content --> Integer(Boolean)
>>> - if an attribute has an array of boolean-only content -->
>>> IntegerList(Boolean) - if an attribute has date-only content --> Date
>>> - if an attribute has time-only content --> Time
>>> - if an attribute has datetime or date content --> DateTime
>>>
>>> I'm not sure we want to invent a .jsont format, but if you download
>>> http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py
>>>
>>> and run  :
>>>
>>> python ogr2vrt.py
>>> "http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request
>>> =getfeature&typename=topp:states&outputformat=json" test.vrt
>>>
>>> This will create you a VRT with the default schema, that you can easily
>>> edit. Note: as with OGR SQL CAST, this is post processing. So if the
>>> guess done by the GeoJSON driver leads to a loss of information, you
>>> cannot recover it. Hopefully the implemented rules will not lead to
>>> information loss.
>>>
>>> A better approach would be to have the schema embedded in a JSON way in
>>> the GeoJSON file itself.
>>> That could be an evolution of the format, but I'm not sure this would be
>>> really popular, given JSON/GeoJSON is heavily used by NoSQL
>>> approaches...
>>>
>>> Hum, doing a quick search, I just found http://json-schema.org/ that
>>> appears to be an IETF draft.
>>> It doesn't look that the schema is embedded in the data file itself.
>>>
>>> There's also GeoJSON-LD that might be a bit related :
>>> https://github.com/geojson/geojson-ld
>>>
>>> CC'ing Sean in case he has thoughts on this.
>>>
>>> Even
>>>
>>>> Hi,
>>>>
>>>> I wonder if GDAL could have some simple and relatively user friendly
>>>> way for defining a schema for GeoJSON data. The GeoJSON driver seems
>>>> to guess the data types of attributes with some undocumented way but
>>>> users could have better knowledge about the desired schema.
>>>>
>>>> I know I can control the data type by using OGR SQL and CAST as in
>>>> ogrinfo -sql "select cast(EMPLOYED as float) from OGRGeojson"
>>>> states.json -so
>>>>
>>>> However, perhaps GeoJSON is enough popular for deserving an easier way
>>>> for writing a schema. First I thought that it would be enough to copy
>>>> the "csvt" text file mechanism from the GDAL CSV driver
>>>> http://www.gdal.org/drv_csv.html. However, the csvt file is a plain
>>>> list of types which will be applied to the attributes in the same
>>>> order than they appear in the text file
>>>> "Integer(5)","Real(10.7)","String(15)"
>>>>
>>>> For GeoJSON it would feel more user friendly to include the attribute
>>>> names in the list somehow like
>>>> "population;Integer(5)","area;Real(10.7)","name;String(15)".
>>>>
>>>> This would make it easier for users to write a valid "jsont" file. A
>>>> list with attribute names could perhaps also help GDAL as well because
>>>> the features in GeoJSON file do not necessarily have same attributes.
>>>>
>>>> As an example this is the right schema for a WFS feature type which is
>>>> captured from
>>>> http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&reques
>>>> t=des
>>>> cribefeaturetype&typename=topp:states
>>>>
>>>>
>>>> name="the_geom" type="gml:MultiPolygonPropertyType"/>
>>>> name="STATE_NAME" type="xsd:string"/>
>>>> name="STATE_FIPS" type="xsd:string"/>
>>>> name="SUB_REGION" type="xsd:string"/>
>>>> name="STATE_ABBR" type="xsd:string"/>
>>>> name="LAND_KM" type="xsd:double"/>
>>>> name="WATER_KM" type="xsd:double"/>
>>>> name="PERSONS" type="xsd:double"/>
>>>> name="FAMILIES" type="xsd:double"/>
>>>> name="HOUSHOLD" type="xsd:double"/>
>>>> name="MALE" type="xsd:double"/>
>>>> name="FEMALE" type="xsd:double"/>
>>>> name="WORKERS" type="xsd:double"/>
>>>> name="DRVALONE" type="xsd:double"/>
>>>> name="CARPOOL" type="xsd:double"/>
>>>> name="PUBTRANS" type="xsd:double"/>
>>>> name="EMPLOYED" type="xsd:double"/>
>>>> name="UNEMPLOY" type="xsd:double"/>
>>>> name="SERVICE" type="xsd:double"/>
>>>> name="MANUAL" type="xsd:double"/>
>>>> name="P_MALE" type="xsd:double"/>
>>>> name="P_FEMALE" type="xsd:double"/>
>>>> name="SAMP_POP" type="xsd:double"/>
>>>>
>>>>
>>>> This is what GDAL is guessing:
>>>> STATE_NAME: String (0.0)
>>>> STATE_FIPS: String (0.0)
>>>> SUB_REGION: String (0.0)
>>>> STATE_ABBR: String (0.0)
>>>> LAND_KM: Real (0.0)
>>>> WATER_KM: Real (0.0)
>>>> PERSONS: Real (0.0)
>>>> FAMILIES: Integer (0.0)
>>>> HOUSHOLD: Real (0.0)
>>>> MALE: Real (0.0)
>>>> FEMALE: Real (0.0)
>>>> WORKERS: Real (0.0)
>>>> DRVALONE: Integer (0.0)
>>>> CARPOOL: Integer (0.0)
>>>> PUBTRANS: Integer (0.0)
>>>> EMPLOYED: Real (0.0)
>>>> UNEMPLOY: Integer (0.0)
>>>> SERVICE: Integer (0.0)
>>>> MANUAL: Integer (0.0)
>>>> P_MALE: Real (0.0)
>>>> P_FEMALE: Real (0.0)
>>>> SAMP_POP: Integer (0.0)
>>>> bbox: RealList (0.0)
>>>>
>>>> -Jukka Rahkonen-
>>>>
>>>> _______________________________________________
>>>> gdal-dev mailing list
>>>> gdal-dev at lists.osgeo.org
>>>> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>>> --
>>> Spatialys - Geospatial professional services http://www.spatialys.com
>> _______________________________________________
>> gdal-dev mailing list
>> gdal-dev at lists.osgeo.org
>> http://lists.osgeo.org/mailman/listinfo/gdal-dev


-- 
Hälsningar

Andreas Oxenstierna
T-Kartan Produkt AB
mobile: +46 733 206831
mailto: ao at t-kartor.se
http://www.t-kartor.com



More information about the gdal-dev mailing list