[postgis-users] "Linux" geocoder script ?

Don harterc1 at comcast.net
Thu Apr 14 01:21:08 PDT 2011


     I was running out of hard disk space on my root partition.  I have, 
for now, the geocoder database on a separate partition and hard drive.
I was debugging and capturing the output of every query and command.  So 
it generated a huge file in the gigabyte range especially when each 
query was aborted because it was not UTF-8.  The iconv command appears 
to be working, albeit though twice.  I will be posting a bug report with 
an attached patch shortly.  I think I deleted the windows section, so if 
you patch it with that, you will have to add that back.  The iconv is 
not in the patch.  I think it will probably be a while before most 
distributions have the new version of shp2pgsql.   I thought that I had 
the latest but it seems it is only available from subversion at this time.
     It is really nice having my own geocoder and tiger2010 database.  
Thanks for everyones help.  I can also use the data in my favorite GIS 
program qgis to draw maps, and even my own mapserver.

On 04/13/2011 07:15 PM, Sylvain Racine wrote:
> Hello,
>
> I saw a little error in your script. If you "pipe" your output to 
> iconv, you have to remove <-W "latin1"> from shp2pgsql command. You 
> should have
> $PGBIN/shp2pgsql -c -s 4269 -g the_geom tl_2010_27_county10.dbf 
> tiger_staging.mn_county10 | iconv -f latin1 -t UTF8 | $PGBIN/psql -d 
> $PGDATABASE
>
> If you don't remove the -W flag, your data will be converted twice!
>
> About your running out of disk space, are you sure that you don't lack 
> of RAM memory instead? I am surprise you ran out of disk space with 
> this command. I think that the output from shp2pgsql is put in RAM 
> memory and then, send to iconv in one shot. If your RAM memory is low, 
> it is possible that the extra data are put in a temporary file on your 
> disk drive (in the swap partition I think). But this temp file 
> indicated that you first lack of RAM memory. I am not complete sure of 
> my theory. But in my mind, I think Linux works like that.
>
> If you want to know the amount of memory needed by this script, type this:
>
> $PGBIN/shp2pgsql -c -s 4269 -g the_geom tl_2010_27_county10.dbf 
> tiger_staging.mn_county10 >> temp.sql
> temp.sql >> iconv -f latin1 -t UTF8 >> anothertemp.sql
> anothertemp.sql >> $PGBIN/psql -d $PGDATABASE
> The size of the max file between temp.sql and anothertemp.sql is the 
> minimun amount of RAM you need to run this command.
>
> Note: The above script is slower than the first one because you write 
> you data on your disk drive between each operation. SATA or IDE disk 
> access is always slower than RAM access, except if you use SSD disk.
>
> Hope it will be useful.
>
> Sylvain Racine
>
> On 13/04/2011 05:40, Don wrote:
>> Here is the version:
>> RELEASE: 2.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 (r$Id: 
>> shp2pgsql-core.h 6358 2010-12-13 20:09:26Z pramsey $)
>> USAGE: shp2pgsql [<options>] <shapefile> [<schema>.]<table>
>>
>> I had tried the iconv fix which is a great idea. It seemed to work, 
>> but I ran out of disk space.
>> $PGBIN/shp2pgsql -c -s 4269 -g the_geom   -W "latin1" 
>> tl_2010_27_county10.dbf tiger_staging.mn_county10 | iconv -f latin1 
>> -t UTF8 | $PGBIN/psql -d $PGDATABASE
>> So I have been spending time on creating more disk space instead of 
>> fixing this.
>> I would like to share the patches that I have so that others can try 
>> to improve it especially those who are more familiar with the 
>> tiger2010 file formats.
>> It would be nice to have a working tiger2010 decoder for linux in the 
>> next postgis release.
>>
>> I don't remember offhand which state had the utf problem.  Several of 
>> them did.
>> I would run ./tiger_load >& del with a "set -x" in the file for 
>> debugging.  (You can then search that file for "aborted" and 
>> backtrack and find out which file it was processing.)  It got rather 
>> large and so did my log files.
>> I was trying to get more info on this when I ran out of disk space.  
>> I could not even vaccuum any databases.  I see many large files in my 
>> postgres directories and wonder what they are all for.  I have been 
>> using 2 different database clusters one on a new larger drive for the 
>> geocoder.  It seems that my cluster on the small drive has a lot of 
>> files on it thought that are taking up a lot of space.
>>
>> On 04/13/2011 01:32 AM, Paragon Corporation wrote:
>>> Don,
>>> Which state were you processing?  I can check it out and see if I 
>>> get similar errors on my shp2pgsql.  You could be right and the file 
>>> just isn't Latin1.
>>> The regress test did seem to pass for me once that ticket was fixed.
>>> Also to confirm you are running the latest version of shp2pgsql
>>> If you run
>>> shp2pgsql from commandline, it should output the version.  Mine for 
>>> example reads
>>> RELEASE: 2.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 (r$Id: 
>>> shp2pgsql-core.h 6925 2011-03-18 16:24:33Z pramsey $)
>>> The version unfortunately isn't quite accurate since its evidentally 
>>> looking at the .h file instead of .c file.  So though my version 
>>> says 6925, its really
>>> 6932 or later.
>>> http://trac.osgeo.org/postgis/changeset/6932
>>> Hope that helps,
>>> Regina
>>> http://www.postgis.us
>>> ------------------------------------------------------------------------
>>> *From:* postgis-users-bounces at postgis.refractions.net 
>>> [mailto:postgis-users-bounces at postgis.refractions.net] *On Behalf Of 
>>> *Don
>>> *Sent:* Tuesday, April 12, 2011 3:08 AM
>>> *To:* PostGIS Users Discussion
>>> *Subject:* Re: [postgis-users] "Linux" geocoder script ?
>>>
>>> My database is encoded as
>>>  geocoder  | drh      | UTF8     | C         | en_US.UTF-8 | .
>>> All my shp2pgsql statements have the -W option like this.
>>> ${loader}  -a -s 4269 -g the_geom -W "latin1" $z 
>>> ${staging_schema}.${state_abbrev}_${table_name} | $PGBIN/psql -d 
>>> $PGDATABASE;
>>>
>>> Here is the bug that I was referring to.
>>> http://trac.osgeo.org/postgis/ticket/808
>>> In one case I had a very large number of inserts processed for the 
>>> shape file and then got that error.
>>>
>>> From your link it says:
>>> "To enable automatic character set conversion, you have to tell 
>>> PostgreSQL the character set (encoding) you would like to use in the 
>>> client. There are several ways to accomplish this: "
>>> Perhaps I need to use
>>> SET CLIENT_ENCODING TO '/value/'; in psql or is shp2pgsql supposed to do that when I use the -W option?
>>> postgis is expecting utf-8 when it should be expecting  latin1 and converting it to utf-8.
>>> Could data type for a column have some effect on this?
>>>
>>>
>>> On 04/11/2011 08:52 PM, Sylvain Racine wrote:
>>>> Hello,
>>>>
>>>> This is not a shp2pgsql bug. You get this error when you try to 
>>>> insert string data in PostgreSQL from another encoding that the one 
>>>> of your database Ex: Your data is formatted in Latin1 (ISO-8859-1) 
>>>> and you insert them in a UTF-8 database. To fix the error message, 
>>>> you need to convert your data.
>>>>
>>>> PostgreSQL have a internal converter. shp2pgsql have it too. Try 
>>>> shp2pgsql -W <encoding> where <encoding> is the format of you DBase 
>>>> file .dbf. This is called the "client encoding" in PostgreSQL. See 
>>>> list of valid encoding type:
>>>> http://www.postgresql.org/docs/9.0/static/multibyte.html
>>>>
>>>> Don't mix it with the database encoding. It is the one you us to 
>>>> create your databse. There is also a default database charset, 
>>>> depending of your OS. It is the one you use to create template1 
>>>> database in init-db.  Mine is "UTF8" on Ubuntu.
>>>>
>>>> Hope that this information will help you
>>>>
>>>> Regards
>>>>
>>>> Sylvain Racine
>>>>
>>>> On 2011-04-11 21:22, Don wrote:
>>>>> I have got the tiger2010 geodecoder to work on my Opensuse system.
>>>>> geocoder=#
>>>>> geocoder=# SELECT g.rating,
>>>>> geocoder-#         ST_X(geomout) As lon,
>>>>> geocoder-#         ST_Y(geomout) As lat, (addy).*
>>>>> geocoder-# FROM geocode('1731 New Hampshire Avenue Northwest, 
>>>>> Washington, DC 20010') As g;
>>>>>  rating |        lon        |       lat        | address | 
>>>>> predirabbrev |  streetname   | streettypeabbrev | postdirabbrev | 
>>>>> internal |  location  | stateabbrev |  zip  | parsed
>>>>> --------+-------------------+------------------+---------+--------------+---------------+------------------+---------------+----------+------------+-------------+-------+-------- 
>>>>>
>>>>>       0 | -77.0399013800607 | 38.9134181361424 |    1731 
>>>>> |              | New Hampshire | Ave              | NW            
>>>>> |          | Washington | DC          | 20009 | t
>>>>> (1 row)
>>>>> There are a few glitches.  I noticed that I am getting this 
>>>>> message sometimes.
>>>>> INSERT 0 1
>>>>> INSERT 0 1
>>>>> INSERT 0 1
>>>>> INSERT 0 1
>>>>> ERROR:  invalid byte sequence for encoding "UTF8": 0xed6f20
>>>>> HINT:  This error can also happen if the byte sequence does not 
>>>>> match the encoding expected by the server, which is controlled by 
>>>>> "client_encoding".
>>>>> ERROR:  current transaction is aborted, commands ignored until end 
>>>>> of transaction block
>>>>> ERROR:  current transaction is aborted, commands ignored until end 
>>>>> of transaction block
>>>>> ERROR:  current transaction is aborted, commands ignored until end 
>>>>> of transaction block
>>>>> I researched this some and it appears to be a  shp2pgsql bug.
>>>>> But I am using postgis-utils-2.0.0SVN-1.2.x86_64
>>>>> postgis-2.0.0SVN-1.2.x86_64  where this has supposedly been 
>>>>> fixed.  Or could the census data be corrupted?
>>>>> So I have "lost" some of the data due to this error.
>>>>> I had problems with psql generating ctrl-m instead of \n which 
>>>>> would really mess up the script when it ran.
>>>>> So after I generated my load tiger script I ran this command
>>>>> tr "\r" "\n" < load_tiger > load_tiger2
>>>>>
>>>>> _______________________________________________
>>>>> postgis-users mailing list
>>>>> postgis-users at postgis.refractions.net
>>>>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>>>>
>>>>>
>>>> _______________________________________________
>>>> postgis-users mailing list
>>>> postgis-users at postgis.refractions.net
>>>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>>>
>>>
>>> _______________________________________________
>>> postgis-users mailing list
>>> postgis-users at postgis.refractions.net
>>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20110414/bfaa23b7/attachment.html>


More information about the postgis-users mailing list