[gdal-dev] Ogr2ogr CSV driver not handling correctly line breaks inside columns
Even Rouault
even.rouault at spatialys.com
Thu May 4 06:39:53 PDT 2023
Moises,
please fild 2 issues in the github issue tracker:
- one about /vsistdout/ where .csvt and .prj content shouldn't be emitted
- one about decoupling the layer GEOMETRY_NAME creation option with
CREATE_CSVT=YES
Even
Le 04/05/2023 à 13:58, Moises Calzado via gdal-dev a écrit :
> Hi Robert!
>
> I think that we're losing a bit the main issue that we reported, as in
> fact the problem is related with line breaks in the output generated
> while using /vsistdout and the CREATE_CSVT=YES option.
>
> Even pointed out that avoiding that flag it works as expected, but
> when it's used the generated output is not okay as the "Fields with
> embedded line breaks must be quoted" rule is not followed.
> IMHO although the generated output is not a CSV itself, we should be
> able to delete the first two lines (projection info and types) and
> deal with the rest of the content as a CSV.
>
> What we're doing is streaming the output of the /vsistdout driver to
> another process that perform some steps with the resultant CSV. In all
> cases it works correctly, as the output of the ogr2ogr execution is a
> valid CSV when deleting the first two lines, but in the case reported
> in my first email it's not.
> The CREATE_CSVT=YES option is mandatory for us as for the moment, it's
> requires to use the GEOMETRY_NAME=*geom *one, so we don't have any
> workaround.
>
> Just wanted to confirm if that's expected for you (generating an
> output that it's not a valid CSV in the end)!
>
> El mié, 3 may 2023 a las 21:05, Robert Hewlett (<rob.hewy at gmail.com>)
> escribió:
>
> Hi,
>
> I just tested with : GDAL 3.6.4, released 2023/04/17
>
> Using the ogr2ogr as follows:
> ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES
> I get three files but no geometry
>
> ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES -lco
> GEOMETRY=AS_WKT
> I get three file with the geometry as WKT with the column name WKT
>
> *WKT*,id,poi_name,poi_types
> "POINT (508878.602179846 5433913.2763688)","1",crescent,"4"
> "POINT (517836.918121302 5447702.01715829)","2",Tynehead Regional
> Park,"1"
>
> ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES -lco
> GEOMETRY=AS_WKT -lco GEOMETRY_NAME=*geom*
> I get three file with the geometry as WKT but the column called *geom*
> *geom*,id,poi_name,poi_types
> "POINT (508878.602179846 5433913.2763688)","1",crescent,"4"
> "POINT (517836.918121302 5447702.01715829)","2",Tynehead Regional
> Park,"1"
>
> What does
> *ogr2ogr --version *
> report back
>
>
>
> On Wed, May 3, 2023 at 9:38 AM Robert Hewlett <rob.hewy at gmail.com>
> wrote:
>
> Hi,
>
> Not to start a controversy but it feels like the standard
> hints at three files. Did the standard change?
>
> If it is three files which works for me in QGIS and geopandas
> i.e. data lands where it is suppose to, then more layer
> creations options are needed to handle the SRID/CRS
>
> CREATE_PRJ=YES/NO
> or -t_srs and/or -s_srs triggers the dot-prj file being created.
>
> Just saying 😊.
>
> In the meantime would a short python script help parse the one
> file into three?
>
>
> On Wed, May 3, 2023 at 9:16 AM Moises Calzado via gdal-dev
> <gdal-dev at lists.osgeo.org> wrote:
>
> Hi Robert,
>
> Yes, we're getting one with all the info!
>
> El mié, 3 may 2023 a las 18:14, Robert Hewlett
> (<rob.hewy at gmail.com>) escribió:
>
> Just to clarify, instead of getting three files you
> are getting one with all the info: types, projection,
> data?
>
> https://giswiki.hsr.ch/GeoCSV
>
> On Wed, May 3, 2023 at 8:57 AM Moises Calzado via
> gdal-dev <gdal-dev at lists.osgeo.org> wrote:
>
> We're also specifying the GEOM_POSSIBLE_NAMES, so
> it would be great if with that option we could use
> the GEOMETRY_NAME without using the
> CREATE_CSVT=YES option.
>
> Regarding emitting the .prj and .csvt in
> /vsistdout mode, that's why I'm saying that there
> is an issue while generating the resultant CSV.
> The way we see it is that when using the
> /vsistdout mode, the result is a CSV file with the
> .prj information in the first line, and the .csvt
> in the second line. We're dealing with the result
> deleting the first two lines and using the rest of
> the content as a CSV, which should be equal to the
> result obtained when using ogr2ogr without the
> CREATE_CSVT=YES option.
> Probably we're losing something, but as we see it,
> the generated CSV should be a valid one. Does that
> make sense?
>
> Thanks so much for your help!
>
> El mié, 3 may 2023 a las 15:10, Robert Hewlett
> (<rob.hewy at gmail.com>) escribió:
>
> The .CSVT and .PRJ help to make a proper
> geocsv dataset. Helps with QGIS And geopandas.
> The column name that I use in the CSV is
> usually geom and WKT shows up in the CSVT file
> which seems to be a one line file that hints
> at the data types in the CSV file.
>
> I hope that makes sense.
>
> CSVT
> Integer, Integer,WKT
>
> CSV
> line_id,point_id,geom
> 1,1,"POINT(1000 1000)"
>
> PRJ
> EPSG:26910
>
>
>
>
> On Wed, May 3, 2023, 05:23 Moises Calzado via
> gdal-dev <gdal-dev at lists.osgeo.org> wrote:
>
> Hi Even,
>
> Thanks so much for taking a look into that
> one!
>
> I have one doubt regarding the CSVT
> content, as we're not really using it, but
> it's required when using the GEOMETRY_NAME
> layer creation option, as can be checked
> in the CSV driver documentation:
>
> *
>
> GEOMETRY_NAME=name (Starting with
> GDAL 2.1): Name of geometry
> column. Only used if
> GEOMETRY=AS_WKT and
> CREATE_CSVT=YES. Defaults to WKT
>
> We really need this flag as we are
> processing files that contain geometries
> with different column names, and we always
> want the same geometry name in the
> generated output. Are we losing something
> when using that flag to avoid this problem?
> In my humble opinion, generating an
> invalid CSV when using the -lco
> CREATE_CSVT=YES looks like a bug for me,
> as I can't see the reason why strings
> containing line breaks can't be quoted.
>
> Could you please shed some light on this?
>
> Looking forward to your reply,
> Regards.
>
> El mié, 3 may 2023 a las 14:00, Even
> Rouault (<even.rouault at spatialys.com>)
> escribió:
>
> you didn't post to the list
>
> Le 03/05/2023 à 13:49, Moises Calzado
> a écrit :
>> Hi Even,
>>
>> Thanks so much for taking a look into
>> that one!
>>
>> I have one doubt regarding the CSVT
>> content, as we're not really using
>> it, but it's required when using the
>> GEOMETRY_NAME layer creation option,
>> as can be checked in the CSV driver
>> documentation:
>>
>> *
>>
>> GEOMETRY_NAME=name (Starting
>> with GDAL 2.1): Name of
>> geometry column. Only used if
>> GEOMETRY=AS_WKT and
>> CREATE_CSVT=YES. Defaults to WKT
>>
>> We really need this flag as we are
>> processing files that contain
>> geometries with different column
>> names, and we always want the same
>> geometry name in the generated
>> output. Are we losing something when
>> using that flag to avoid this problem?
>> In my humble opinion, generating an
>> invalid CSV when using the -lco
>> CREATE_CSVT=YES looks like a bug for
>> me, as I can't see the reason why
>> strings containing line breaks can't
>> be quoted.
>>
>> Could you please shed some light on this?
>>
>> Looking forward to your reply,
>> Regards.
>>
>> El sáb, 29 abr 2023 a las 15:44, Even
>> Rouault
>> (<even.rouault at spatialys.com>) escribió:
>>
>> Moises,
>>
>> as far as I can see with your
>> example, the CSV driver behaves
>> "properly" in reading and writing
>> of field values with line breaks.
>>
>> It follows the "Fields with
>> embedded line breaks must be
>> quoted" rule of
>> https://en.wikipedia.org/wiki/Comma-separated_values
>>
>> $ ogr2ogr out.csv
>> /vsizip/dataframe.zip
>>
>> $ cat out.csv
>> id,descriptio
>> "1",This is my third row
>> "2","this is
>> my string
>> "
>> "3",This is my third row
>>
>> $ ogrinfo out.csv -al
>> INFO: Open of `out.csv'
>> using driver `CSV' successful.
>>
>> Layer name: out
>> Geometry: None
>> Feature Count: 3
>> Layer SRS WKT:
>> (unknown)
>> id: String (0.0)
>> descriptio: String (0.0)
>> OGRFeature(out):1
>> id (String) = 1
>> descriptio (String) = This is
>> my third row
>>
>> OGRFeature(out):2
>> id (String) = 2
>> descriptio (String) = this is
>> my string
>>
>>
>> OGRFeature(out):3
>> id (String) = 3
>> descriptio (String) = This is
>> my third row
>>
>> But in your example using
>> /vsistdout/ and -lco
>> CREATE_CSVT=YES is going to
>> result in an invalid CSV file
>> which will mix both the .csvt and
>> .csv content
>>
>> Even
>>
>> Le 24/04/2023 à 13:34, Moises
>> Calzado via gdal-dev a écrit :
>>> Hello!
>>>
>>> We're trying to convert a
>>> Shapefile into a CSV using
>>> ogr2ogr and we're having some
>>> issues while dealing with some
>>> columns that contain line breaks
>>> inside their values. If we have
>>> a line with the following
>>> string, ogr2ogr detects that the
>>> line break is a new line and it
>>> returns two lines.
>>>
>>> "this is my \n value"
>>>
>>>
>>> That's the command that we're
>>> executing:
>>>
>>> ogr2ogr -f CSV -skipfailures
>>> -makevalid /vsistdout/
>>> /vsizip/shapefile.zip
>>> -simplify 0.00001 -dim XY
>>> -t_srs EPSG:4326 -lco
>>> GEOMETRY=AS_WKT -lco
>>> GEOMETRY_NAME=geom -lco
>>> CREATE_CSVT=YES > result.csv
>>>
>>>
>>> Is this an expected behaviour,
>>> or is there any way to avoid this?
>>> Sharing an example Shapefile so
>>> that you can try to reproduce
>>> that behaviour:
>>> https://drive.google.com/file/d/1gFqfTP02KTFoavJyyO-Ix05YwZB2tS24/view?usp=sharing
>>>
>>> Thanks so much in advance,
>>> Regards.
>>>
>>> --
>>> *Moises Calzado*
>>>
>>> Support Engineer
>>>
>>> +34671264286 |
>>> mcalzado at carto.com | CARTO
>>> <https://www.carto.com/>
>>>
>>> <https://spatial-data-science-conference.com/2023/london/>
>>>
>>>
>>> _______________________________________________
>>> gdal-dev mailing list
>>> gdal-dev at lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>>
>> --
>> http://www.spatialys.com
>> My software is free, but my time generally not.
>>
>>
>>
>> --
>> *Moises Calzado*
>>
>> Support Engineer
>>
>> +34671264286 | mcalzado at carto.com |
>> CARTO <https://www.carto.com/>
>>
>> <https://spatial-data-science-conference.com/2023/london/>
>>
>
> --
> http://www.spatialys.com
> My software is free, but my time generally not.
>
>
>
> --
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO
> <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
> --
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO
> <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
> --
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO
> <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
> --
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20230504/805aaf03/attachment-0001.htm>
More information about the gdal-dev
mailing list