<div dir="ltr"><div>Thanks for your reply Even. Very helpful!</div><div><br></div>>I'm not a JS programmer, nevertheless I tried to understand<br><a href="https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js" rel="noreferrer" target="_blank">https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js</a>, and you seem to<br>group GeoJSON Feature objects by batch of 5000 (*) , put them in a temp .json<br>file, and assemble all the JSon files in a VRT that looks like the following,<br>right ?<div><br>That's correct.</div><div><br></div><div>> Just wanted to warn that creating layers of the same name is more or less<br>undefined behaviour and the way ogr2ogr will handle that is also unspecified.<br>You're quite lucky this works. Actually from what I see it will use only the<br>layer definition of the first tmp1.json and ignore any potential additional<br>fields of the following fields."<br></div><div><br></div><div>Perhaps it works because I am only appending to a single layer that has the same schema throughout?</div><div><br></div><div>My full command looks like:</div><div><span style="color:rgb(24,54,145);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.8px;white-space:pre">`--config SHAPE_ENCODING UTF-8 -f "ESRI Shapefile" ./dummy layer.vrt -nlt POINT -fieldmap identity -append -lco ENCODING=UTF-8</span><span class="pl-pds" style="color:rgb(24,54,145);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.8px;white-space:pre">'`</span><br></div><div><span class="pl-pds" style="color:rgb(24,54,145);font-family:Consolas,'Liberation Mono',Menlo,Courier,monospace;font-size:12px;line-height:16.8px;white-space:pre"><br></span></div>> A cleaner solution would be to use a <OGRVRTUnionLayer> to wrap all the<br><OVRTVRTLayer> (see <a href="http://www.gdal.org/drv_vrt.html" rel="noreferrer" target="_blank">http://www.gdal.org/drv_vrt.html</a>), but this would perhaps<br>have bad performance due to a first pass being done to established the union'ed<br>layer definition from the individual sources.<div><br></div><div>I did test this, but the performance was indeed slower. Is there a way for me to specify the schema before-hand and avoid a full first pass? </div><div><br></div><div>Perhaps for a driver implementation with a streaming parser, I could write a vrt before hand, then pipe in GeoJSON that matches that schema?</div><div><br></div><div>> potential solution would be to buffer let's say<br>the first MB of features and build the layer definition from it</div><div><br></div><div>That's pretty close to what I'm doing with this block: <a href="https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js#L56-L65">https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js#L56-L65</a></div><div><br></div><div>I'm sampling the first batch of geojson and am using it to build up parameters with this function: <a href="https://github.com/koopjs/GeoXForm/blob/master/src/lib/ogr-cmd.js">https://github.com/koopjs/GeoXForm/blob/master/src/lib/ogr-cmd.js</a>.</div><div><br></div><div>> (*) It looks like you manage to separate JSon Feature objects with just string<br>spliting on ',{' pattern ? </div><div><br></div><div>It is indeed fragile. It works well for my use-case (and is faster than parsing) because I'm creating all the geojson. But it doesn't extend well for others, so I've replaced it with a true streaming parser. Thanks for the feedback.</div><div><br></div><div><br><div><br><br><div class="gmail_quote"><div dir="ltr">On Wed, Jan 27, 2016 at 5:57 AM Even Rouault <<a href="mailto:even.rouault@spatialys.com">even.rouault@spatialys.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
><br>
> I’m curious if anyone has ideas or advice on how to use a streaming parser<br>
> in the OGR GeoJSON driver.<br>
><br>
<br>
A streaming parser, or at least something not requiring full ingestion in<br>
memory of a geojson file, is something that would indeed solve issues that<br>
people run into with the current driver on big files (let's say several<br>
hundreds of megabytes or more)<br>
<br>
><br>
> My use-case is that I need to convert arbitrarily-sized streams of geojson<br>
> into other formats (e.g. Csv, shapefile, kml, etc).<br>
><br>
><br>
> My current strategy is to first partition the GeoJSON into a VRT file and<br>
> then call OGR. This works for arbitrary sized streams, but it’s<br>
> inefficient because the process is blocked until the entire VRT is ready.<br>
> You can see my implementation here: <a href="https://github.com/koopjs/GeoXForm" rel="noreferrer" target="_blank">https://github.com/koopjs/GeoXForm</a>.<br>
<br>
I'm not a JS programmer, nevertheless I tried to understand<br>
<a href="https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js" rel="noreferrer" target="_blank">https://github.com/koopjs/GeoXForm/blob/master/src/lib/vrt.js</a>, and you seem to<br>
group GeoJSON Feature objects by batch of 5000 (*) , put them in a temp .json<br>
file, and assemble all the JSon files in a VRT that looks like the following,<br>
right ?<br>
<br>
<OGRVRTDataSource><br>
<OGRVRTLayer name="OGRGeoJSON"><br>
<SrcDataSource>tmp1.json</SrcDataSource><br>
</OGRVRTLayer><br>
<OGRVRTLayer name="OGRGeoJSON"><br>
<SrcDataSource>tmp2.json</SrcDataSource><br>
</OGRVRTLayer><br>
</OGRVRTDataSource><br>
<br>
Just wanted to warn that creating layers of the same name is more or less<br>
undefined behaviour and the way ogr2ogr will handle that is also unspecified.<br>
You're quite lucky this works. Actually from what I see it will use only the<br>
layer definition of the first tmp1.json and ignore any potential additional<br>
fields of the following fields.<br>
<br>
A cleaner solution would be to use a <OGRVRTUnionLayer> to wrap all the<br>
<OVRTVRTLayer> (see <a href="http://www.gdal.org/drv_vrt.html" rel="noreferrer" target="_blank">http://www.gdal.org/drv_vrt.html</a>), but this would perhaps<br>
have bad performance due to a first pass being done to established the union'ed<br>
layer definition from the individual sources.<br>
<br>
><br>
><br>
> I noticed that there exists at least one C library for parsing son streams:<br>
> <a href="https://github.com/lloyd/yajl" rel="noreferrer" target="_blank">https://github.com/lloyd/yajl</a>, but I do not know enough C++ (or C for that<br>
> matter) to integrate it.<br>
><br>
><br>
> Has anyone considered this approach before? Any advice on how to implement<br>
> it?<br>
<br>
One tricky point is to establish the layer definition (ie identifying the<br>
fields/properties). Currently the driver does a first pass to build the schema<br>
by examining the properties of each Feature object and unioning them, and then<br>
a second one to build the OGRFeature objects<br>
<br>
With a JSon streaming parsing library, when operating on a file on which you<br>
can seek arbitrarily, a similar strategy could be applied. From the point of<br>
view of the user, nothing would be changed except that there would be no<br>
longer any limit to the size of the files that can be processed<br>
But when operating on the input stream that you cannot rewind, this 2 pass<br>
strategy becomes a problem. A potential solution would be to buffer let's say<br>
the first MB of features and build the layer definition from it, assuming that<br>
next features will follow the same schema (and if not ignore the extra<br>
attributes). Or introduce the concept of non fixed schema (ie the schema would<br>
evolve when you iterate over the features) in OGR, but this would have broader<br>
implications.<br>
<br>
Even<br>
<br>
(*) It looks like you manage to separate JSon Feature objects with just string<br>
spliting on ',{' pattern ? That looks extremelly fragile to additional space<br>
characters, or complex properties inside a Feature object, like<br>
<br>
{ "type": "Feature", "properties": { "prop": [ {"foo":"bar"},{"bar":"baz"} ]<br>
}, "geometry": null }<br>
<br>
<br>
--<br>
Spatialys - Geospatial professional services<br>
<a href="http://www.spatialys.com" rel="noreferrer" target="_blank">http://www.spatialys.com</a><br>
</blockquote></div></div></div></div>