[gdal-dev] Ogr2ogr CSV driver not handling correctly line breaks inside columns
Robert Hewlett
rob.hewy at gmail.com
Fri May 5 06:12:47 PDT 2023
Can something such as
head -n -2
Be part of the pipeline?
The 3 text files are being combined into 1 stream.
- Line 1 CRS/SRID from the .prj
- Line 2 Types from the .cvst
- Line 3 to the end from the .csv
Which is great in some ways as the SRID does not go missing and header info
is at the head.
It is just that I found from line 3 to the end were well formed with the
renamed geometry column but I am testing on Windows 10 with 3.6.
I do not know if /vsizip/ as output is allowed or works i.e. all three text
files as one streamed zip file then extract just the CSV file later in the
process.
Moving to a one file spatial format as mentioned above might help. It is
just that a GeoCSV dataset is a combination of three files.
Maybe a many-to-one-back-to-many-scenario might help.
There are several multi-file spatial formats that would need to be zipped
so that you could stream just one thing.
I hope that makes sense.
.
On Fri, May 5, 2023 at 2:58 AM Rahkonen Jukka <
jukka.rahkonen at maanmittauslaitos.fi> wrote:
> Hi,
>
>
>
> Have you considered to output GeoJSONseq
> https://gdal.org/drivers/vector/geojsonseq.html instead of CSV, that for
> my mind is a workaround as a geodata format. Maybe JSON could handle your
> newlines by the same.
>
>
>
> -Jukka Rahkonen-
>
>
>
> *Lähettäjä:* gdal-dev <gdal-dev-bounces at lists.osgeo.org> *Puolesta *Moises
> Calzado via gdal-dev
> *Lähetetty:* perjantai 5. toukokuuta 2023 12.32
> *Vastaanottaja:* gdal-dev at lists.osgeo.org
> *Aihe:* Re: [gdal-dev] Ogr2ogr CSV driver not handling correctly line
> breaks inside columns
>
>
>
> Hi Even!
>
>
>
> I've just created the two issues:
>
> - https://github.com/OSGeo/gdal/issues/7699
>
> - https://github.com/OSGeo/gdal/issues/7700
>
>
>
> Robert, as I explained before, we need the `/vsistdout/` driver as we're
> processing the file in streaming mode, so we can't save the result to the
> storage.
>
> Unforteunately, the problem arises when using that driver.
>
>
>
> El jue, 4 may 2023 a las 15:39, Even Rouault (<even.rouault at spatialys.com>)
> escribió:
>
> Moises,
>
> please fild 2 issues in the github issue tracker:
>
> - one about /vsistdout/ where .csvt and .prj content shouldn't be emitted
>
> - one about decoupling the layer GEOMETRY_NAME creation option with
> CREATE_CSVT=YES
>
> Even
>
> Le 04/05/2023 à 13:58, Moises Calzado via gdal-dev a écrit :
>
> Hi Robert!
>
>
>
> I think that we're losing a bit the main issue that we reported, as in
> fact the problem is related with line breaks in the output generated while
> using /vsistdout and the CREATE_CSVT=YES option.
>
>
>
> Even pointed out that avoiding that flag it works as expected, but when
> it's used the generated output is not okay as the "Fields with embedded
> line breaks must be quoted" rule is not followed.
>
> IMHO although the generated output is not a CSV itself, we should be able
> to delete the first two lines (projection info and types) and deal with the
> rest of the content as a CSV.
>
>
>
> What we're doing is streaming the output of the /vsistdout driver to
> another process that perform some steps with the resultant CSV. In all
> cases it works correctly, as the output of the ogr2ogr execution is a valid
> CSV when deleting the first two lines, but in the case reported in my first
> email it's not.
>
> The CREATE_CSVT=YES option is mandatory for us as for the moment, it's
> requires to use the GEOMETRY_NAME=*geom *one, so we don't have any
> workaround.
>
>
>
> Just wanted to confirm if that's expected for you (generating an output
> that it's not a valid CSV in the end)!
>
>
>
> El mié, 3 may 2023 a las 21:05, Robert Hewlett (<rob.hewy at gmail.com>)
> escribió:
>
> Hi,
>
>
>
> I just tested with : GDAL 3.6.4, released 2023/04/17
>
>
>
> Using the ogr2ogr as follows:
>
> ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES
>
> I get three files but no geometry
>
>
>
> ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES -lco
> GEOMETRY=AS_WKT
>
> I get three file with the geometry as WKT with the column name WKT
>
>
>
> *WKT*,id,poi_name,poi_types
>
> "POINT (508878.602179846 5433913.2763688)","1",crescent,"4"
> "POINT (517836.918121302 5447702.01715829)","2",Tynehead Regional Park,"1"
>
>
>
> ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES -lco
> GEOMETRY=AS_WKT -lco GEOMETRY_NAME=*geom*
>
> I get three file with the geometry as WKT but the column called *geom*
>
> *geom*,id,poi_name,poi_types
> "POINT (508878.602179846 5433913.2763688)","1",crescent,"4"
> "POINT (517836.918121302 5447702.01715829)","2",Tynehead Regional Park,"1"
>
>
>
> What does
>
> *ogr2ogr --version *
>
> report back
>
>
>
>
>
>
>
> On Wed, May 3, 2023 at 9:38 AM Robert Hewlett <rob.hewy at gmail.com> wrote:
>
> Hi,
>
>
>
> Not to start a controversy but it feels like the standard hints at three
> files. Did the standard change?
>
>
>
> If it is three files which works for me in QGIS and geopandas i.e. data
> lands where it is suppose to, then more layer creations options are needed
> to handle the SRID/CRS
>
>
>
> CREATE_PRJ=YES/NO
>
> or -t_srs and/or -s_srs triggers the dot-prj file being created.
>
>
>
> Just saying 😊.
>
>
>
> In the meantime would a short python script help parse the one file into
> three?
>
>
>
>
>
> On Wed, May 3, 2023 at 9:16 AM Moises Calzado via gdal-dev <
> gdal-dev at lists.osgeo.org> wrote:
>
> Hi Robert,
>
>
>
> Yes, we're getting one with all the info!
>
>
>
> El mié, 3 may 2023 a las 18:14, Robert Hewlett (<rob.hewy at gmail.com>)
> escribió:
>
> Just to clarify, instead of getting three files you are getting one with
> all the info: types, projection, data?
>
> https://giswiki.hsr.ch/GeoCSV
>
>
>
> On Wed, May 3, 2023 at 8:57 AM Moises Calzado via gdal-dev <
> gdal-dev at lists.osgeo.org> wrote:
>
> We're also specifying the GEOM_POSSIBLE_NAMES, so it would be great if
> with that option we could use the GEOMETRY_NAME without using the
> CREATE_CSVT=YES option.
>
>
>
> Regarding emitting the .prj and .csvt in /vsistdout mode, that's why I'm
> saying that there is an issue while generating the resultant CSV.
>
> The way we see it is that when using the /vsistdout mode, the result is a
> CSV file with the .prj information in the first line, and the .csvt in the
> second line. We're dealing with the result deleting the first two lines and
> using the rest of the content as a CSV, which should be equal to the result
> obtained when using ogr2ogr without the CREATE_CSVT=YES option.
>
> Probably we're losing something, but as we see it, the generated CSV
> should be a valid one. Does that make sense?
>
>
>
> Thanks so much for your help!
>
>
>
> El mié, 3 may 2023 a las 15:10, Robert Hewlett (<rob.hewy at gmail.com>)
> escribió:
>
> The .CSVT and .PRJ help to make a proper geocsv dataset. Helps with QGIS
> And geopandas. The column name that I use in the CSV is usually geom and
> WKT shows up in the CSVT file which seems to be a one line file that hints
> at the data types in the CSV file.
>
>
>
> I hope that makes sense.
>
>
>
> CSVT
>
> Integer, Integer,WKT
>
>
>
> CSV
>
> line_id,point_id,geom
>
> 1,1,"POINT(1000 1000)"
>
>
>
> PRJ
>
> EPSG:26910
>
>
>
>
>
>
>
>
>
> On Wed, May 3, 2023, 05:23 Moises Calzado via gdal-dev <
> gdal-dev at lists.osgeo.org> wrote:
>
> Hi Even,
>
>
>
> Thanks so much for taking a look into that one!
>
>
>
> I have one doubt regarding the CSVT content, as we're not really using it,
> but it's required when using the GEOMETRY_NAME layer creation option, as
> can be checked in the CSV driver documentation:
>
>
>
> · *GEOMETRY_NAME*=name (Starting with GDAL 2.1): Name of geometry
> column. Only used if GEOMETRY=AS_WKT and CREATE_CSVT=YES. Defaults to WKT
>
> We really need this flag as we are processing files that contain
> geometries with different column names, and we always want the same
> geometry name in the generated output. Are we losing something when using
> that flag to avoid this problem?
>
> In my humble opinion, generating an invalid CSV when using the -lco
> CREATE_CSVT=YES looks like a bug for me, as I can't see the reason why
> strings containing line breaks can't be quoted.
>
>
>
> Could you please shed some light on this?
>
>
>
> Looking forward to your reply,
>
> Regards.
>
>
>
> El mié, 3 may 2023 a las 14:00, Even Rouault (<even.rouault at spatialys.com>)
> escribió:
>
> you didn't post to the list
>
> Le 03/05/2023 à 13:49, Moises Calzado a écrit :
>
> Hi Even,
>
>
>
> Thanks so much for taking a look into that one!
>
>
>
> I have one doubt regarding the CSVT content, as we're not really using it,
> but it's required when using the GEOMETRY_NAME layer creation option, as
> can be checked in the CSV driver documentation:
>
>
>
> · *GEOMETRY_NAME*=name (Starting with GDAL 2.1): Name of geometry
> column. Only used if GEOMETRY=AS_WKT and CREATE_CSVT=YES. Defaults to WKT
>
> We really need this flag as we are processing files that contain
> geometries with different column names, and we always want the same
> geometry name in the generated output. Are we losing something when using
> that flag to avoid this problem?
>
> In my humble opinion, generating an invalid CSV when using the -lco
> CREATE_CSVT=YES looks like a bug for me, as I can't see the reason why
> strings containing line breaks can't be quoted.
>
>
>
> Could you please shed some light on this?
>
>
>
> Looking forward to your reply,
>
> Regards.
>
>
>
> El sáb, 29 abr 2023 a las 15:44, Even Rouault (<even.rouault at spatialys.com>)
> escribió:
>
> Moises,
>
> as far as I can see with your example, the CSV driver behaves "properly"
> in reading and writing of field values with line breaks.
>
> It follows the "Fields with embedded line breaks must be quoted" rule of
> https://en.wikipedia.org/wiki/Comma-separated_values
>
> $ ogr2ogr out.csv /vsizip/dataframe.zip
>
> $ cat out.csv
> id,descriptio
> "1",This is my third row
> "2","this is
> my string
> "
> "3",This is my third row
>
> $ ogrinfo out.csv -al
> INFO: Open of `out.csv'
> using driver `CSV' successful.
>
> Layer name: out
> Geometry: None
> Feature Count: 3
> Layer SRS WKT:
> (unknown)
> id: String (0.0)
> descriptio: String (0.0)
> OGRFeature(out):1
> id (String) = 1
> descriptio (String) = This is my third row
>
> OGRFeature(out):2
> id (String) = 2
> descriptio (String) = this is
> my string
>
>
> OGRFeature(out):3
> id (String) = 3
> descriptio (String) = This is my third row
>
> But in your example using /vsistdout/ and -lco CREATE_CSVT=YES is going to
> result in an invalid CSV file which will mix both the .csvt and .csv content
>
> Even
>
> Le 24/04/2023 à 13:34, Moises Calzado via gdal-dev a écrit :
>
> Hello!
>
>
>
> We're trying to convert a Shapefile into a CSV using ogr2ogr and we're
> having some issues while dealing with some columns that contain line breaks
> inside their values. If we have a line with the following string, ogr2ogr
> detects that the line break is a new line and it returns two lines.
>
>
>
> "this is my \n value"
>
>
>
> That's the command that we're executing:
>
>
>
> ogr2ogr -f CSV -skipfailures -makevalid /vsistdout/ /vsizip/shapefile.zip
> -simplify 0.00001 -dim XY -t_srs EPSG:4326 -lco GEOMETRY=AS_WKT -lco
> GEOMETRY_NAME=geom -lco CREATE_CSVT=YES > result.csv
>
>
>
> Is this an expected behaviour, or is there any way to avoid this?
>
> Sharing an example Shapefile so that you can try to reproduce that
> behaviour:
> https://drive.google.com/file/d/1gFqfTP02KTFoavJyyO-Ix05YwZB2tS24/view?usp=sharing
>
>
>
> Thanks so much in advance,
>
> Regards.
>
>
>
> --
>
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
>
>
> _______________________________________________
>
> gdal-dev mailing list
>
> gdal-dev at lists.osgeo.org
>
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> --
>
> http://www.spatialys.com
>
> My software is free, but my time generally not.
>
>
>
>
> --
>
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
> --
>
> http://www.spatialys.com
>
> My software is free, but my time generally not.
>
>
>
>
> --
>
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
>
> --
>
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
>
> --
>
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
>
> --
>
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
>
>
> _______________________________________________
>
> gdal-dev mailing list
>
> gdal-dev at lists.osgeo.org
>
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> --
>
> http://www.spatialys.com
>
> My software is free, but my time generally not.
>
>
>
>
> --
>
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20230505/bfd66ad1/attachment-0001.htm>
More information about the gdal-dev
mailing list