[postgis-users] Import CSV (was: Noob question with shp2pgsql)

MJ mj at sci.fi
Sun Apr 14 04:10:28 PDT 2013


Parsing CSV files is one of the nastiest computing problems around. Very frequently, CSV files will have unparsable lines. Ogr2ogr is not going to solve this problem for you - it will only  consume Correctly Formatted CSV files.

Before you can even get around to handling the problem of unparsable lines, oftentimes a character set conversion is required. There are, unfortunately, way too many folks who publish CSV files, shapefiles, and SQL dumps which contain UTF-8 multibyte encoding sequences saved in ISO-8859-1 file encoding. I need to run iconv on roughly 90% of the shapefiles I load with shp2pgsql - generally any shapefile (or CSV or SQL dump) which was produced by a North American or Western European person which contains international data. This class of folk seem to believe that since ISO 8859-1 or ISO 8859-15 works for their own character set, it works for the entire world. In 2013, there is absolutely no reason for anyone on this planet to be encoding in something other than UTF-8 - disk space and bandwidth is cheap enough now and in the areas of the world where it's not yet cheap enough, UTF-8 is the only choice anyway.

What causes unparsable lines in CSV? Quotes where there aren't supposed to be, missing quotes,missing fields, ambiguously utilised and unescaped delimiter characters, etc. Manual correction is difficult when you are handling, for example, an 80 thousand line file.

Here is a tool I wrote to fix CSV files from one particularly nasty source. It changes a file delimited by commas into a file delimited by tabs, as well as correcting a whole host of other common problems. I have found that it works quite well, in general, for multiple sources of nastily encoded CSV files.


Use it like this:

fix-csv.pl nasty.csv > fixed.csv



#!/usr/bin/perl -w

while (<>)
{
  # 1. remove ^M
  $_ =~ s/\r//g;

  # 2. change commas at beginning of line to tabs
  $_ =~ s/^,/\t/;

  # 3. change "," to "\t" ("tab")
  $_ =~ s/","/"\t"/g;

  # 4. change ", to \t (tab)
  $_ =~ s/",/\t/g;

  # 5. change ," to \t (tab)
  $_ =~ s/,"/\t/g;

  # 6. change \t, to \t\t (double tab)
  $_ =~ s/\t,/\t\t/g;

  # 7. change \t, to \t\t (double tab)
  $_ =~ s/\t,/\t\t/g;

  # 8. change \t, to \t\t (double tab)
  $_ =~ s/\t,/\t\t/g;

  # 9. change \t, to \t\t (double tab)
  $_ =~ s/\t,/\t\t/g;

  # 10. remove quotes
  $_ =~ s/"//g;

  print $_;
}


-mike







On Apr 13, 2013, at 10:41 PM, Nathan Hemenway <nhemenway at kksbolash.com> wrote:

> As Richard Greenwood noted, ogr2ogr works great for importing CSV files into Postgres tables.
> In fact, your CSV file does not necessarily even need to have any geometry related columns for this to work.
> 
> It is all documented here very nicely:
> 
> http://www.gdal.org/ogr/drv_csv.html
> 
> 
> 
> On 4/13/2013 5:54 AM, Margie Roswell wrote:
>> I figured out that COPY is used to import a file into a table.
>> 
>> (Actually, even though I don't speak a word of Portuguese, a Portuguese video did a great job of showing copying first into a temp table: https://www.youtube.com/watch?v=CwsnPPub9v4 )
>> 
>> But the shp2pgsql thread yesterday got me thinking: to import a shapefile, they've created a utility so that we don't have to set up the structure of the table in advance
>> 
>> Is there something similar on the CSV side?
>> 
>> My guess is that http://www.safe.com/solutions/for-databases/postgis/
>> might have something, but I can't quite put my finger on it.
>> 
>> Details on that? 
>> 
>> Also, I'm sure there's a fee for that. Are there any other strategies for making the table creation more efficient, when importing a file to a table?
>> 
>> I suppose I could copy and paste the field names from the top row in the original Excel spreadsheet, and then manually reformat them into a CREATE NEW TABLE statement by adding all the field types. What strategies (like the shp2pgsql utility?) reduce the pain of importing a text file?
>> 
>> Margie
>> 
>> --
>> http://FarmBillPrimer.org
>> http://www.BaltimoreUrbanAg.org (Please send events; This site is hungry.)
>> http://www.ExcellentNutrition.org
>> http://www.packtpub.com/drupal-5-views-recipes/book
>> 
>> 
>> On Fri, Apr 12, 2013 at 6:14 PM, David Rush <david at rushtone.com> wrote:
>> Total noob to PostgreSQL and PostGIS here.  Trying to follow examples from the Obe+Hsu book (1st Ed) in using shp2pgsql from the command line to import some tiger county data.
>> 
>> I ran this:
>> 
>> shp2pgsql -s 4269 -g geom_4269 -W LATIN1 c:/users/david/downloads/tl_2012_us_county/tl_2012_us_county.shp public.us_counties psql -h localhost -U postgres -p 5432 -d mygisdb 
>> 
>> Thanks to an archive of this list that led me to add the "-W LATIN1" param (it was failing with an error w/out it).
>> 
>> Now the command runs for several minutes, spitting out mostly zillions of hex digits, with no overt errors.  Last line it spits out is "COMMIT;".
>> 
>> But when I go into psql, I can't find the public.us_counties table that I thought I just added created:
>> 
>> mygisdb=# select * from public.us_counties;
>> ERROR:  relation "public.us_counties" does not exist
>> LINE 1: select * from public.us_counties;
>>                       ^
>> mygisdb=# select table_schema, table_name,table_type from information_schema.tables where
>> table_schema not in ('pg_catalog','information_schema');
>>  table_schema |    table_name     | table_type
>> --------------+-------------------+------------
>>  public       | geography_columns | VIEW
>>  public       | geometry_columns  | VIEW
>>  public       | spatial_ref_sys   | BASE TABLE
>>  ch01         | lu_franchises     | BASE TABLE
>>  ch01         | fastfoods         | BASE TABLE
>> (5 rows)
>> 
>> Poking around with pgAdmin III I can't find in anywhere, either.
>> 
>> Is the new table us_counties hiding somewhere?  Or did it quietly fail?  Or what?
>> 
>> David
>> 
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>> 
>> 
>> 
>> 
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at lists.osgeo.org
>> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
> 
> 
> -- 
> .nathan.
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20130414/c225e905/attachment.html>


More information about the postgis-users mailing list