[gdal-dev] Ogr2ogr CSV driver not handling correctly line breaks inside columns

Even Rouault even.rouault at spatialys.com
Thu May 4 06:39:53 PDT 2023


Moises,

please fild 2 issues in the github issue tracker:

- one about /vsistdout/ where .csvt and .prj content shouldn't be emitted

- one about decoupling the layer GEOMETRY_NAME creation option with 
CREATE_CSVT=YES

Even

Le 04/05/2023 à 13:58, Moises Calzado via gdal-dev a écrit :
> Hi Robert!
>
> I think that we're losing a bit the main issue that we reported, as in 
> fact the problem is related with line breaks in the output generated 
> while using /vsistdout and the CREATE_CSVT=YES option.
>
> Even pointed out that avoiding that flag it works as expected, but 
> when it's used the generated output is not okay as the "Fields with 
> embedded line breaks must be quoted" rule is not followed.
> IMHO although the generated output is not a CSV itself, we should be 
> able to delete the first two lines (projection info and types) and 
> deal with the rest of the content as a CSV.
>
> What we're doing is streaming the output of the /vsistdout driver to 
> another process that perform some steps with the resultant CSV. In all 
> cases it works correctly, as the output of the ogr2ogr execution is a 
> valid CSV when deleting the first two lines, but in the case reported 
> in my first email it's not.
> The CREATE_CSVT=YES option is mandatory for us as for the moment, it's 
> requires to use the GEOMETRY_NAME=*geom *one, so we don't have any 
> workaround.
>
> Just wanted to confirm if that's expected for you (generating an 
> output that it's not a valid CSV in the end)!
>
> El mié, 3 may 2023 a las 21:05, Robert Hewlett (<rob.hewy at gmail.com>) 
> escribió:
>
>     Hi,
>
>     I just tested with : GDAL 3.6.4, released 2023/04/17
>
>     Using the ogr2ogr as follows:
>     ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES
>     I get three files but no geometry
>
>     ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES -lco
>     GEOMETRY=AS_WKT
>     I get three file with the geometry as WKT with the column name WKT
>
>     *WKT*,id,poi_name,poi_types
>     "POINT (508878.602179846 5433913.2763688)","1",crescent,"4"
>     "POINT (517836.918121302 5447702.01715829)","2",Tynehead Regional
>     Park,"1"
>
>     ogr2ogr -f CSV poi_out.csv poi.shp -lco CREATE_CSVT=YES -lco
>     GEOMETRY=AS_WKT -lco GEOMETRY_NAME=*geom*
>     I get three file with the geometry as WKT but the column called *geom*
>     *geom*,id,poi_name,poi_types
>     "POINT (508878.602179846 5433913.2763688)","1",crescent,"4"
>     "POINT (517836.918121302 5447702.01715829)","2",Tynehead Regional
>     Park,"1"
>
>     What does
>     *ogr2ogr --version *
>     report back
>
>
>
>     On Wed, May 3, 2023 at 9:38 AM Robert Hewlett <rob.hewy at gmail.com>
>     wrote:
>
>         Hi,
>
>         Not to start a controversy but it feels like the standard
>         hints at three files. Did the standard change?
>
>         If it is three files which works for me in QGIS and geopandas
>         i.e. data lands where it is suppose to, then more layer
>         creations options are needed to handle the SRID/CRS
>
>         CREATE_PRJ=YES/NO
>         or -t_srs and/or -s_srs triggers the dot-prj file being created.
>
>         Just saying 😊.
>
>         In the meantime would a short python script help parse the one
>         file into three?
>
>
>         On Wed, May 3, 2023 at 9:16 AM Moises Calzado via gdal-dev
>         <gdal-dev at lists.osgeo.org> wrote:
>
>             Hi Robert,
>
>             Yes, we're getting one with all the info!
>
>             El mié, 3 may 2023 a las 18:14, Robert Hewlett
>             (<rob.hewy at gmail.com>) escribió:
>
>                 Just to clarify, instead of getting three files you
>                 are getting one with all the info: types, projection,
>                 data?
>
>                 https://giswiki.hsr.ch/GeoCSV
>
>                 On Wed, May 3, 2023 at 8:57 AM Moises Calzado via
>                 gdal-dev <gdal-dev at lists.osgeo.org> wrote:
>
>                     We're also specifying the GEOM_POSSIBLE_NAMES, so
>                     it would be great if with that option we could use
>                     the GEOMETRY_NAME without using the
>                     CREATE_CSVT=YES option.
>
>                     Regarding emitting the .prj and .csvt in
>                     /vsistdout mode, that's why I'm saying that there
>                     is an issue while generating the resultant CSV.
>                     The way we see it is that when using the
>                     /vsistdout mode, the result is a CSV file with the
>                     .prj information in the first line, and the .csvt
>                     in the second line. We're dealing with the result
>                     deleting the first two lines and using the rest of
>                     the content as a CSV, which should be equal to the
>                     result obtained when using ogr2ogr without the
>                     CREATE_CSVT=YES option.
>                     Probably we're losing something, but as we see it,
>                     the generated CSV should be a valid one. Does that
>                     make sense?
>
>                     Thanks so much for your help!
>
>                     El mié, 3 may 2023 a las 15:10, Robert Hewlett
>                     (<rob.hewy at gmail.com>) escribió:
>
>                         The .CSVT and .PRJ help to make a proper
>                         geocsv dataset. Helps with QGIS And geopandas.
>                         The column name that I use in the CSV is
>                         usually geom and WKT shows up in the CSVT file
>                         which seems to be a one line file that hints
>                         at the data types in the CSV file.
>
>                         I hope that makes sense.
>
>                         CSVT
>                         Integer, Integer,WKT
>
>                         CSV
>                         line_id,point_id,geom
>                         1,1,"POINT(1000 1000)"
>
>                         PRJ
>                         EPSG:26910
>
>
>
>
>                         On Wed, May 3, 2023, 05:23 Moises Calzado via
>                         gdal-dev <gdal-dev at lists.osgeo.org> wrote:
>
>                             Hi Even,
>
>                             Thanks so much for taking a look into that
>                             one!
>
>                             I have one doubt regarding the CSVT
>                             content, as we're not really using it, but
>                             it's required when using the GEOMETRY_NAME
>                             layer creation option, as can be checked
>                             in the CSV driver documentation:
>
>                                  *
>
>                                     GEOMETRY_NAME=name (Starting with
>                                     GDAL 2.1): Name of geometry
>                                     column. Only used if
>                                     GEOMETRY=AS_WKT and
>                                     CREATE_CSVT=YES. Defaults to WKT
>
>                             We really need this flag as we are
>                             processing files that contain geometries
>                             with different column names, and we always
>                             want the same geometry name in the
>                             generated output. Are we losing something
>                             when using that flag to avoid this problem?
>                             In my humble opinion, generating an
>                             invalid CSV when using the -lco
>                             CREATE_CSVT=YES looks like a bug for me,
>                             as I can't see the reason why strings
>                             containing line breaks can't be quoted.
>
>                             Could you please shed some light on this?
>
>                             Looking forward to your reply,
>                             Regards.
>
>                             El mié, 3 may 2023 a las 14:00, Even
>                             Rouault (<even.rouault at spatialys.com>)
>                             escribió:
>
>                                 you didn't post to the list
>
>                                 Le 03/05/2023 à 13:49, Moises Calzado
>                                 a écrit :
>>                                 Hi Even,
>>
>>                                 Thanks so much for taking a look into
>>                                 that one!
>>
>>                                 I have one doubt regarding the CSVT
>>                                 content, as we're not really using
>>                                 it, but it's required when using the
>>                                 GEOMETRY_NAME layer creation option,
>>                                 as can be checked in the CSV driver
>>                                 documentation:
>>
>>                                      *
>>
>>                                         GEOMETRY_NAME=name (Starting
>>                                         with GDAL 2.1): Name of
>>                                         geometry column. Only used if
>>                                         GEOMETRY=AS_WKT and
>>                                         CREATE_CSVT=YES. Defaults to WKT
>>
>>                                 We really need this flag as we are
>>                                 processing files that contain
>>                                 geometries with different column
>>                                 names, and we always want the same
>>                                 geometry name in the generated
>>                                 output. Are we losing something when
>>                                 using that flag to avoid this problem?
>>                                 In my humble opinion, generating an
>>                                 invalid CSV when using the -lco
>>                                 CREATE_CSVT=YES looks like a bug for
>>                                 me, as I can't see the reason why
>>                                 strings containing line breaks can't
>>                                 be quoted.
>>
>>                                 Could you please shed some light on this?
>>
>>                                 Looking forward to your reply,
>>                                 Regards.
>>
>>                                 El sáb, 29 abr 2023 a las 15:44, Even
>>                                 Rouault
>>                                 (<even.rouault at spatialys.com>) escribió:
>>
>>                                     Moises,
>>
>>                                     as far as I can see with your
>>                                     example, the CSV driver behaves
>>                                     "properly" in reading and writing
>>                                     of field values with line breaks.
>>
>>                                     It follows the "Fields with
>>                                     embedded line breaks must be
>>                                     quoted" rule of
>>                                     https://en.wikipedia.org/wiki/Comma-separated_values
>>
>>                                     $ ogr2ogr out.csv
>>                                     /vsizip/dataframe.zip
>>
>>                                     $ cat out.csv
>>                                     id,descriptio
>>                                     "1",This is my third row
>>                                     "2","this is
>>                                     my string
>>                                     "
>>                                     "3",This is my third row
>>
>>                                     $ ogrinfo out.csv -al
>>                                     INFO: Open of `out.csv'
>>                                           using driver `CSV' successful.
>>
>>                                     Layer name: out
>>                                     Geometry: None
>>                                     Feature Count: 3
>>                                     Layer SRS WKT:
>>                                     (unknown)
>>                                     id: String (0.0)
>>                                     descriptio: String (0.0)
>>                                     OGRFeature(out):1
>>                                       id (String) = 1
>>                                       descriptio (String) = This is
>>                                     my third row
>>
>>                                     OGRFeature(out):2
>>                                       id (String) = 2
>>                                       descriptio (String) = this is
>>                                     my string
>>
>>
>>                                     OGRFeature(out):3
>>                                       id (String) = 3
>>                                       descriptio (String) = This is
>>                                     my third row
>>
>>                                     But in your example using
>>                                     /vsistdout/ and -lco
>>                                     CREATE_CSVT=YES is going to
>>                                     result in an invalid CSV file
>>                                     which will mix both the .csvt and
>>                                     .csv content
>>
>>                                     Even
>>
>>                                     Le 24/04/2023 à 13:34, Moises
>>                                     Calzado via gdal-dev a écrit :
>>>                                     Hello!
>>>
>>>                                     We're trying to convert a
>>>                                     Shapefile into a CSV using
>>>                                     ogr2ogr and we're having some
>>>                                     issues while dealing with some
>>>                                     columns that contain line breaks
>>>                                     inside their values. If we have
>>>                                     a line with the following
>>>                                     string, ogr2ogr detects that the
>>>                                     line break is a new line and it
>>>                                     returns two lines.
>>>
>>>                                         "this is my \n value"
>>>
>>>
>>>                                     That's the command that we're
>>>                                     executing:
>>>
>>>                                         ogr2ogr -f CSV -skipfailures
>>>                                         -makevalid /vsistdout/
>>>                                         /vsizip/shapefile.zip
>>>                                         -simplify 0.00001 -dim XY
>>>                                         -t_srs EPSG:4326 -lco
>>>                                         GEOMETRY=AS_WKT -lco
>>>                                         GEOMETRY_NAME=geom -lco
>>>                                         CREATE_CSVT=YES > result.csv
>>>
>>>
>>>                                     Is this an expected behaviour,
>>>                                     or is there any way to avoid this?
>>>                                     Sharing an example Shapefile so
>>>                                     that you can try to reproduce
>>>                                     that behaviour:
>>>                                     https://drive.google.com/file/d/1gFqfTP02KTFoavJyyO-Ix05YwZB2tS24/view?usp=sharing
>>>
>>>                                     Thanks so much in advance,
>>>                                     Regards.
>>>
>>>                                     -- 
>>>                                     *Moises Calzado*
>>>
>>>                                     Support Engineer
>>>
>>>                                     +34671264286 |
>>>                                     mcalzado at carto.com | CARTO
>>>                                     <https://www.carto.com/>
>>>
>>>                                     <https://spatial-data-science-conference.com/2023/london/>
>>>
>>>
>>>                                     _______________________________________________
>>>                                     gdal-dev mailing list
>>>                                     gdal-dev at lists.osgeo.org
>>>                                     https://lists.osgeo.org/mailman/listinfo/gdal-dev
>>
>>                                     -- 
>>                                     http://www.spatialys.com
>>                                     My software is free, but my time generally not.
>>
>>
>>
>>                                 -- 
>>                                 *Moises Calzado*
>>
>>                                 Support Engineer
>>
>>                                 +34671264286 | mcalzado at carto.com |
>>                                 CARTO <https://www.carto.com/>
>>
>>                                 <https://spatial-data-science-conference.com/2023/london/>
>>
>
>                                 -- 
>                                 http://www.spatialys.com
>                                 My software is free, but my time generally not.
>
>
>
>                             -- 
>                             *Moises Calzado*
>
>                             Support Engineer
>
>                             +34671264286 | mcalzado at carto.com | CARTO
>                             <https://www.carto.com/>
>
>                             <https://spatial-data-science-conference.com/2023/london/>
>
>                             _______________________________________________
>                             gdal-dev mailing list
>                             gdal-dev at lists.osgeo.org
>                             https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>                         _______________________________________________
>                         gdal-dev mailing list
>                         gdal-dev at lists.osgeo.org
>                         https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
>                     -- 
>                     *Moises Calzado*
>
>                     Support Engineer
>
>                     +34671264286 | mcalzado at carto.com | CARTO
>                     <https://www.carto.com/>
>
>                     <https://spatial-data-science-conference.com/2023/london/>
>
>                     _______________________________________________
>                     gdal-dev mailing list
>                     gdal-dev at lists.osgeo.org
>                     https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>                 _______________________________________________
>                 gdal-dev mailing list
>                 gdal-dev at lists.osgeo.org
>                 https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
>             -- 
>             *Moises Calzado*
>
>             Support Engineer
>
>             +34671264286 | mcalzado at carto.com | CARTO
>             <https://www.carto.com/>
>
>             <https://spatial-data-science-conference.com/2023/london/>
>             _______________________________________________
>             gdal-dev mailing list
>             gdal-dev at lists.osgeo.org
>             https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>     _______________________________________________
>     gdal-dev mailing list
>     gdal-dev at lists.osgeo.org
>     https://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
> -- 
> *Moises Calzado*
>
> Support Engineer
>
> +34671264286 | mcalzado at carto.com | CARTO <https://www.carto.com/>
>
> <https://spatial-data-science-conference.com/2023/london/>
>
> _______________________________________________
> gdal-dev mailing list
> gdal-dev at lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/gdal-dev

-- 
http://www.spatialys.com
My software is free, but my time generally not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/gdal-dev/attachments/20230504/805aaf03/attachment-0001.htm>


More information about the gdal-dev mailing list