[gdal-dev] Streaming Parser for OGR GeoJSON Driver

Daniel Fenton dmfenton at gmail.com
Thu Jan 28 14:16:12 PST 2016


Thanks for your reply Even. Very helpful!

>I'm not a JS programmer, nevertheless I tried to understand
https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js, and you seem
to
group GeoJSON Feature objects by batch of 5000 (*) , put them in a temp
.json
file, and assemble all the JSon files in a VRT that looks like the
following,
right ?

That's correct.

> Just wanted to warn that creating layers of the same name is more or less
undefined behaviour and the way ogr2ogr will handle that is also
unspecified.
You're quite lucky this works. Actually from what I see it will use only the
layer definition of the first tmp1.json and ignore any potential additional
fields of the following fields."

Perhaps it works because I am only appending to a single layer that has the
same schema throughout?

My full command looks like:
`--config SHAPE_ENCODING UTF-8 -f "ESRI Shapefile" ./dummy layer.vrt -nlt
POINT -fieldmap identity -append -lco ENCODING=UTF-8'`

> A cleaner solution would be to use a <OGRVRTUnionLayer> to wrap all the
<OVRTVRTLayer> (see http://www.gdal.org/drv_vrt.html), but this would
perhaps
have bad performance due to a first pass being done to established the
union'ed
layer definition from the individual sources.

I did test this, but the performance was indeed slower. Is there a way for
me to specify the schema before-hand and avoid a full first pass?

Perhaps for a driver implementation with a streaming parser, I could write
a vrt before hand, then pipe in GeoJSON that matches that schema?

>  potential solution would be to buffer let's say
the first MB of features and build the layer definition from it

That's pretty close to what I'm doing with this block:
https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js#L56-L65

I'm sampling the first batch of geojson and am using it to build up
parameters with this function:
https://github.com/koopjs/GeoXForm/blob/master/src/lib/ogr-cmd.js.

> (*) It looks like you manage to separate JSon Feature objects with just
string
spliting on ',{' pattern ?

It is indeed fragile. It works well for my use-case (and is faster than
parsing) because I'm creating all the geojson. But it doesn't extend well
for others, so I've replaced it with a true streaming parser. Thanks for
the feedback.




On Wed, Jan 27, 2016 at 5:57 AM Even Rouault <even.rouault at spatialys.com>
wrote:

> Hi,
>
> >
> > I’m curious if anyone has ideas or advice on how to use a streaming
> parser
> > in the OGR GeoJSON driver.
> >
>
> A streaming parser, or at least something not requiring full ingestion in
> memory of a geojson file, is something that would indeed solve issues that
> people run into with the current driver on big files (let's say several
> hundreds of megabytes or more)
>
> >
> > My use-case is that I need to convert arbitrarily-sized streams of
> geojson
> > into other formats (e.g. Csv, shapefile, kml, etc).
> >
> >
> > My current strategy is to first partition the GeoJSON into a VRT file and
> > then call OGR. This works for arbitrary sized streams, but  it’s
> > inefficient because the process is blocked until the entire VRT is ready.
> > You can see my implementation here: https://github.com/koopjs/GeoXForm.
>
> I'm not a JS programmer, nevertheless I tried to understand
> https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js, and you
> seem to
> group GeoJSON Feature objects by batch of 5000 (*) , put them in a temp
> .json
> file, and assemble all the JSon files in a VRT that looks like the
> following,
> right ?
>
> <OGRVRTDataSource>
>     <OGRVRTLayer name="OGRGeoJSON">
>         <SrcDataSource>tmp1.json</SrcDataSource>
>     </OGRVRTLayer>
>     <OGRVRTLayer name="OGRGeoJSON">
>         <SrcDataSource>tmp2.json</SrcDataSource>
>     </OGRVRTLayer>
> </OGRVRTDataSource>
>
> Just wanted to warn that creating layers of the same name is more or less
> undefined behaviour and the way ogr2ogr will handle that is also
> unspecified.
> You're quite lucky this works. Actually from what I see it will use only
> the
> layer definition of the first tmp1.json and ignore any potential additional
> fields of the following fields.
>
> A cleaner solution would be to use a <OGRVRTUnionLayer> to wrap all the
> <OVRTVRTLayer> (see http://www.gdal.org/drv_vrt.html), but this would
> perhaps
> have bad performance due to a first pass being done to established the
> union'ed
> layer definition from the individual sources.
>
> >
> >
> > I noticed that there exists at least one C library for parsing son
> streams:
> > https://github.com/lloyd/yajl, but I do not know enough C++ (or C for
> that
> > matter) to integrate it.
> >
> >
> > Has anyone considered this approach before? Any advice on how to
> implement
> > it?
>
> One tricky point is to establish the layer definition (ie identifying the
> fields/properties). Currently the driver does a first pass to build the
> schema
> by examining the properties of each Feature object and unioning them, and
> then
> a second one to build the OGRFeature objects
>
> With a JSon streaming parsing library, when operating on a file on which
> you
> can seek arbitrarily, a similar strategy could be applied. From the point
> of
> view of the user, nothing would be changed except that there would be no
> longer any limit to the size of the files that can be processed
> But when operating on the input stream that you cannot rewind, this 2 pass
> strategy becomes a problem. A potential solution would be to buffer let's
> say
> the first MB of features and build the layer definition from it, assuming
> that
> next features will follow the same schema (and if not ignore the extra
> attributes). Or introduce the concept of non fixed schema (ie the schema
> would
> evolve when you iterate over the features) in OGR, but this would have
> broader
> implications.
>
> Even
>
> (*) It looks like you manage to separate JSon Feature objects with just
> string
> spliting on ',{' pattern ? That looks extremelly fragile to additional
> space
> characters, or complex properties inside a Feature object, like
>
> { "type": "Feature", "properties": { "prop": [ {"foo":"bar"},{"bar":"baz"}
> ]
> }, "geometry": null }
>
>
> --
> Spatialys - Geospatial professional services
> http://www.spatialys.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20160128/9b14594c/attachment.html>


More information about the gdal-dev mailing list