[postgis-users] "Linux" geocoder script ?
Don
harterc1 at comcast.net
Thu Apr 14 01:21:08 PDT 2011
I was running out of hard disk space on my root partition. I have,
for now, the geocoder database on a separate partition and hard drive.
I was debugging and capturing the output of every query and command. So
it generated a huge file in the gigabyte range especially when each
query was aborted because it was not UTF-8. The iconv command appears
to be working, albeit though twice. I will be posting a bug report with
an attached patch shortly. I think I deleted the windows section, so if
you patch it with that, you will have to add that back. The iconv is
not in the patch. I think it will probably be a while before most
distributions have the new version of shp2pgsql. I thought that I had
the latest but it seems it is only available from subversion at this time.
It is really nice having my own geocoder and tiger2010 database.
Thanks for everyones help. I can also use the data in my favorite GIS
program qgis to draw maps, and even my own mapserver.
On 04/13/2011 07:15 PM, Sylvain Racine wrote:
> Hello,
>
> I saw a little error in your script. If you "pipe" your output to
> iconv, you have to remove <-W "latin1"> from shp2pgsql command. You
> should have
> $PGBIN/shp2pgsql -c -s 4269 -g the_geom tl_2010_27_county10.dbf
> tiger_staging.mn_county10 | iconv -f latin1 -t UTF8 | $PGBIN/psql -d
> $PGDATABASE
>
> If you don't remove the -W flag, your data will be converted twice!
>
> About your running out of disk space, are you sure that you don't lack
> of RAM memory instead? I am surprise you ran out of disk space with
> this command. I think that the output from shp2pgsql is put in RAM
> memory and then, send to iconv in one shot. If your RAM memory is low,
> it is possible that the extra data are put in a temporary file on your
> disk drive (in the swap partition I think). But this temp file
> indicated that you first lack of RAM memory. I am not complete sure of
> my theory. But in my mind, I think Linux works like that.
>
> If you want to know the amount of memory needed by this script, type this:
>
> $PGBIN/shp2pgsql -c -s 4269 -g the_geom tl_2010_27_county10.dbf
> tiger_staging.mn_county10 >> temp.sql
> temp.sql >> iconv -f latin1 -t UTF8 >> anothertemp.sql
> anothertemp.sql >> $PGBIN/psql -d $PGDATABASE
> The size of the max file between temp.sql and anothertemp.sql is the
> minimun amount of RAM you need to run this command.
>
> Note: The above script is slower than the first one because you write
> you data on your disk drive between each operation. SATA or IDE disk
> access is always slower than RAM access, except if you use SSD disk.
>
> Hope it will be useful.
>
> Sylvain Racine
>
> On 13/04/2011 05:40, Don wrote:
>> Here is the version:
>> RELEASE: 2.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 (r$Id:
>> shp2pgsql-core.h 6358 2010-12-13 20:09:26Z pramsey $)
>> USAGE: shp2pgsql [<options>] <shapefile> [<schema>.]<table>
>>
>> I had tried the iconv fix which is a great idea. It seemed to work,
>> but I ran out of disk space.
>> $PGBIN/shp2pgsql -c -s 4269 -g the_geom -W "latin1"
>> tl_2010_27_county10.dbf tiger_staging.mn_county10 | iconv -f latin1
>> -t UTF8 | $PGBIN/psql -d $PGDATABASE
>> So I have been spending time on creating more disk space instead of
>> fixing this.
>> I would like to share the patches that I have so that others can try
>> to improve it especially those who are more familiar with the
>> tiger2010 file formats.
>> It would be nice to have a working tiger2010 decoder for linux in the
>> next postgis release.
>>
>> I don't remember offhand which state had the utf problem. Several of
>> them did.
>> I would run ./tiger_load >& del with a "set -x" in the file for
>> debugging. (You can then search that file for "aborted" and
>> backtrack and find out which file it was processing.) It got rather
>> large and so did my log files.
>> I was trying to get more info on this when I ran out of disk space.
>> I could not even vaccuum any databases. I see many large files in my
>> postgres directories and wonder what they are all for. I have been
>> using 2 different database clusters one on a new larger drive for the
>> geocoder. It seems that my cluster on the small drive has a lot of
>> files on it thought that are taking up a lot of space.
>>
>> On 04/13/2011 01:32 AM, Paragon Corporation wrote:
>>> Don,
>>> Which state were you processing? I can check it out and see if I
>>> get similar errors on my shp2pgsql. You could be right and the file
>>> just isn't Latin1.
>>> The regress test did seem to pass for me once that ticket was fixed.
>>> Also to confirm you are running the latest version of shp2pgsql
>>> If you run
>>> shp2pgsql from commandline, it should output the version. Mine for
>>> example reads
>>> RELEASE: 2.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 (r$Id:
>>> shp2pgsql-core.h 6925 2011-03-18 16:24:33Z pramsey $)
>>> The version unfortunately isn't quite accurate since its evidentally
>>> looking at the .h file instead of .c file. So though my version
>>> says 6925, its really
>>> 6932 or later.
>>> http://trac.osgeo.org/postgis/changeset/6932
>>> Hope that helps,
>>> Regina
>>> http://www.postgis.us
>>> ------------------------------------------------------------------------
>>> *From:* postgis-users-bounces at postgis.refractions.net
>>> [mailto:postgis-users-bounces at postgis.refractions.net] *On Behalf Of
>>> *Don
>>> *Sent:* Tuesday, April 12, 2011 3:08 AM
>>> *To:* PostGIS Users Discussion
>>> *Subject:* Re: [postgis-users] "Linux" geocoder script ?
>>>
>>> My database is encoded as
>>> geocoder | drh | UTF8 | C | en_US.UTF-8 | .
>>> All my shp2pgsql statements have the -W option like this.
>>> ${loader} -a -s 4269 -g the_geom -W "latin1" $z
>>> ${staging_schema}.${state_abbrev}_${table_name} | $PGBIN/psql -d
>>> $PGDATABASE;
>>>
>>> Here is the bug that I was referring to.
>>> http://trac.osgeo.org/postgis/ticket/808
>>> In one case I had a very large number of inserts processed for the
>>> shape file and then got that error.
>>>
>>> From your link it says:
>>> "To enable automatic character set conversion, you have to tell
>>> PostgreSQL the character set (encoding) you would like to use in the
>>> client. There are several ways to accomplish this: "
>>> Perhaps I need to use
>>> SET CLIENT_ENCODING TO '/value/'; in psql or is shp2pgsql supposed to do that when I use the -W option?
>>> postgis is expecting utf-8 when it should be expecting latin1 and converting it to utf-8.
>>> Could data type for a column have some effect on this?
>>>
>>>
>>> On 04/11/2011 08:52 PM, Sylvain Racine wrote:
>>>> Hello,
>>>>
>>>> This is not a shp2pgsql bug. You get this error when you try to
>>>> insert string data in PostgreSQL from another encoding that the one
>>>> of your database Ex: Your data is formatted in Latin1 (ISO-8859-1)
>>>> and you insert them in a UTF-8 database. To fix the error message,
>>>> you need to convert your data.
>>>>
>>>> PostgreSQL have a internal converter. shp2pgsql have it too. Try
>>>> shp2pgsql -W <encoding> where <encoding> is the format of you DBase
>>>> file .dbf. This is called the "client encoding" in PostgreSQL. See
>>>> list of valid encoding type:
>>>> http://www.postgresql.org/docs/9.0/static/multibyte.html
>>>>
>>>> Don't mix it with the database encoding. It is the one you us to
>>>> create your databse. There is also a default database charset,
>>>> depending of your OS. It is the one you use to create template1
>>>> database in init-db. Mine is "UTF8" on Ubuntu.
>>>>
>>>> Hope that this information will help you
>>>>
>>>> Regards
>>>>
>>>> Sylvain Racine
>>>>
>>>> On 2011-04-11 21:22, Don wrote:
>>>>> I have got the tiger2010 geodecoder to work on my Opensuse system.
>>>>> geocoder=#
>>>>> geocoder=# SELECT g.rating,
>>>>> geocoder-# ST_X(geomout) As lon,
>>>>> geocoder-# ST_Y(geomout) As lat, (addy).*
>>>>> geocoder-# FROM geocode('1731 New Hampshire Avenue Northwest,
>>>>> Washington, DC 20010') As g;
>>>>> rating | lon | lat | address |
>>>>> predirabbrev | streetname | streettypeabbrev | postdirabbrev |
>>>>> internal | location | stateabbrev | zip | parsed
>>>>> --------+-------------------+------------------+---------+--------------+---------------+------------------+---------------+----------+------------+-------------+-------+--------
>>>>>
>>>>> 0 | -77.0399013800607 | 38.9134181361424 | 1731
>>>>> | | New Hampshire | Ave | NW
>>>>> | | Washington | DC | 20009 | t
>>>>> (1 row)
>>>>> There are a few glitches. I noticed that I am getting this
>>>>> message sometimes.
>>>>> INSERT 0 1
>>>>> INSERT 0 1
>>>>> INSERT 0 1
>>>>> INSERT 0 1
>>>>> ERROR: invalid byte sequence for encoding "UTF8": 0xed6f20
>>>>> HINT: This error can also happen if the byte sequence does not
>>>>> match the encoding expected by the server, which is controlled by
>>>>> "client_encoding".
>>>>> ERROR: current transaction is aborted, commands ignored until end
>>>>> of transaction block
>>>>> ERROR: current transaction is aborted, commands ignored until end
>>>>> of transaction block
>>>>> ERROR: current transaction is aborted, commands ignored until end
>>>>> of transaction block
>>>>> I researched this some and it appears to be a shp2pgsql bug.
>>>>> But I am using postgis-utils-2.0.0SVN-1.2.x86_64
>>>>> postgis-2.0.0SVN-1.2.x86_64 where this has supposedly been
>>>>> fixed. Or could the census data be corrupted?
>>>>> So I have "lost" some of the data due to this error.
>>>>> I had problems with psql generating ctrl-m instead of \n which
>>>>> would really mess up the script when it ran.
>>>>> So after I generated my load tiger script I ran this command
>>>>> tr "\r" "\n" < load_tiger > load_tiger2
>>>>>
>>>>> _______________________________________________
>>>>> postgis-users mailing list
>>>>> postgis-users at postgis.refractions.net
>>>>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>>>>
>>>>>
>>>> _______________________________________________
>>>> postgis-users mailing list
>>>> postgis-users at postgis.refractions.net
>>>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>>>
>>>
>>> _______________________________________________
>>> postgis-users mailing list
>>> postgis-users at postgis.refractions.net
>>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>>
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users at postgis.refractions.net
>> http://postgis.refractions.net/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at postgis.refractions.net
> http://postgis.refractions.net/mailman/listinfo/postgis-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20110414/bfaa23b7/attachment.html>
More information about the postgis-users
mailing list