[gdal-dev] Streaming Parser for OGR GeoJSON Driver

Wed Jan 27 02:57:15 PST 2016

Hi,

> 
> I’m curious if anyone has ideas or advice on how to use a streaming parser
> in the OGR GeoJSON driver.
> 

A streaming parser, or at least something not requiring full ingestion in 
memory of a geojson file, is something that would indeed solve issues that 
people run into with the current driver on big files (let's say several  
hundreds of megabytes or more)

> 
> My use-case is that I need to convert arbitrarily-sized streams of geojson
> into other formats (e.g. Csv, shapefile, kml, etc).
> 
> 
> My current strategy is to first partition the GeoJSON into a VRT file and
> then call OGR. This works for arbitrary sized streams, but  it’s
> inefficient because the process is blocked until the entire VRT is ready.
> You can see my implementation here: https://github.com/koopjs/GeoXForm.

I'm not a JS programmer, nevertheless I tried to understand 
https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js, and you seem to 
group GeoJSON Feature objects by batch of 5000 (*) , put them in a temp .json 
file, and assemble all the JSon files in a VRT that looks like the following, 
right ?

<OGRVRTDataSource>
    <OGRVRTLayer name="OGRGeoJSON">
        <SrcDataSource>tmp1.json</SrcDataSource>
    </OGRVRTLayer>
    <OGRVRTLayer name="OGRGeoJSON">
        <SrcDataSource>tmp2.json</SrcDataSource>
    </OGRVRTLayer>
</OGRVRTDataSource>

Just wanted to warn that creating layers of the same name is more or less 
undefined behaviour and the way ogr2ogr will handle that is also unspecified. 
You're quite lucky this works. Actually from what I see it will use only the 
layer definition of the first tmp1.json and ignore any potential additional 
fields of the following fields.

A cleaner solution would be to use a <OGRVRTUnionLayer> to wrap all the 
<OVRTVRTLayer> (see http://www.gdal.org/drv_vrt.html), but this would perhaps 
have bad performance due to a first pass being done to established the union'ed 
layer definition from the individual sources.

> 
> 
> I noticed that there exists at least one C library for parsing son streams:
> https://github.com/lloyd/yajl, but I do not know enough C++ (or C for that
> matter) to integrate it.
> 
> 
> Has anyone considered this approach before? Any advice on how to implement
> it?

One tricky point is to establish the layer definition (ie identifying the 
fields/properties). Currently the driver does a first pass to build the schema 
by examining the properties of each Feature object and unioning them, and then 
a second one to build the OGRFeature objects

With a JSon streaming parsing library, when operating on a file on which you 
can seek arbitrarily, a similar strategy could be applied. From the point of 
view of the user, nothing would be changed except that there would be no 
longer any limit to the size of the files that can be processed
But when operating on the input stream that you cannot rewind, this 2 pass 
strategy becomes a problem. A potential solution would be to buffer let's say 
the first MB of features and build the layer definition from it, assuming that 
next features will follow the same schema (and if not ignore the extra 
attributes). Or introduce the concept of non fixed schema (ie the schema would 
evolve when you iterate over the features) in OGR, but this would have broader 
implications.

Even

(*) It looks like you manage to separate JSon Feature objects with just string 
spliting on ',{' pattern ? That looks extremelly fragile to additional space 
characters, or complex properties inside a Feature object, like

{ "type": "Feature", "properties": { "prop": [ {"foo":"bar"},{"bar":"baz"} ] 
}, "geometry": null }

-- 
Spatialys - Geospatial professional services
http://www.spatialys.com