[gdal-dev] Streaming Parser for OGR GeoJSON Driver
Even Rouault
even.rouault at spatialys.com
Wed Jan 27 02:57:15 PST 2016
Hi,
>
> I’m curious if anyone has ideas or advice on how to use a streaming parser
> in the OGR GeoJSON driver.
>
A streaming parser, or at least something not requiring full ingestion in
memory of a geojson file, is something that would indeed solve issues that
people run into with the current driver on big files (let's say several
hundreds of megabytes or more)
>
> My use-case is that I need to convert arbitrarily-sized streams of geojson
> into other formats (e.g. Csv, shapefile, kml, etc).
>
>
> My current strategy is to first partition the GeoJSON into a VRT file and
> then call OGR. This works for arbitrary sized streams, but it’s
> inefficient because the process is blocked until the entire VRT is ready.
> You can see my implementation here: https://github.com/koopjs/GeoXForm.
I'm not a JS programmer, nevertheless I tried to understand
https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js, and you seem to
group GeoJSON Feature objects by batch of 5000 (*) , put them in a temp .json
file, and assemble all the JSon files in a VRT that looks like the following,
right ?
<OGRVRTDataSource>
<OGRVRTLayer name="OGRGeoJSON">
<SrcDataSource>tmp1.json</SrcDataSource>
</OGRVRTLayer>
<OGRVRTLayer name="OGRGeoJSON">
<SrcDataSource>tmp2.json</SrcDataSource>
</OGRVRTLayer>
</OGRVRTDataSource>
Just wanted to warn that creating layers of the same name is more or less
undefined behaviour and the way ogr2ogr will handle that is also unspecified.
You're quite lucky this works. Actually from what I see it will use only the
layer definition of the first tmp1.json and ignore any potential additional
fields of the following fields.
A cleaner solution would be to use a <OGRVRTUnionLayer> to wrap all the
<OVRTVRTLayer> (see http://www.gdal.org/drv_vrt.html), but this would perhaps
have bad performance due to a first pass being done to established the union'ed
layer definition from the individual sources.
>
>
> I noticed that there exists at least one C library for parsing son streams:
> https://github.com/lloyd/yajl, but I do not know enough C++ (or C for that
> matter) to integrate it.
>
>
> Has anyone considered this approach before? Any advice on how to implement
> it?
One tricky point is to establish the layer definition (ie identifying the
fields/properties). Currently the driver does a first pass to build the schema
by examining the properties of each Feature object and unioning them, and then
a second one to build the OGRFeature objects
With a JSon streaming parsing library, when operating on a file on which you
can seek arbitrarily, a similar strategy could be applied. From the point of
view of the user, nothing would be changed except that there would be no
longer any limit to the size of the files that can be processed
But when operating on the input stream that you cannot rewind, this 2 pass
strategy becomes a problem. A potential solution would be to buffer let's say
the first MB of features and build the layer definition from it, assuming that
next features will follow the same schema (and if not ignore the extra
attributes). Or introduce the concept of non fixed schema (ie the schema would
evolve when you iterate over the features) in OGR, but this would have broader
implications.
Even
(*) It looks like you manage to separate JSon Feature objects with just string
spliting on ',{' pattern ? That looks extremelly fragile to additional space
characters, or complex properties inside a Feature object, like
{ "type": "Feature", "properties": { "prop": [ {"foo":"bar"},{"bar":"baz"} ]
}, "geometry": null }
--
Spatialys - Geospatial professional services
http://www.spatialys.com
More information about the gdal-dev
mailing list