[postgis-users] Batch geocoding (~2 mil addresses)

Ravi ada raviada at dobeyond.com
Mon Dec 12 21:08:35 PST 2011


Hello Regina,

 

I was able to load the data for all states, now I have to gocode about 2 mil
addresses . I followed the link below and used the pgScript option to batch
100 addresses. I am having a terrible performance, query return time  is 10
to 100 times more than the bench marks. 

 

This query is supposed to take only (61ms) but on my machine is was (734ms).

 

SELECT g.rating, ST_X(g.geomout) As lon, ST_Y(g.geomout) As lat, 

                (addy).address As stno, (addy).streetname As street, 

                (addy).streettypeabbrev As styp, (addy).location As city,
(addy).stateabbrev As st,(addy).zip 

                FROM geocode('75 State Street, Boston MA 02109') As g;

 

 

I am running this on a machine which has 16GB RAM and 6 cores AMD processor.
Something is missing to cause this hang up. please direct me to right
direction. I am looking for help.

 

Thanks

Ravi Ada

 

From: postgis-users-bounces at postgis.refractions.net
[mailto:postgis-users-bounces at postgis.refractions.net] On Behalf Of Paragon
Corporation
Sent: Wednesday, November 30, 2011 8:39 AM
To: 'PostGIS Users Discussion'
Subject: Re: [postgis-users] Batch geocoding (~2 mil addresses)

 

Ravi,

 

If you have pgAdmin, you might want to try something like pgScript.  Unlike
plpgsql it commits on each update so doesn't suffer from the same issue as a
stored proc.

 

http://www.postgresonline.com/journal/archives/181-pgscript_intro.html

 

Hope that helps,

Regina

http://www.postgis.us

 

 

  _____  

From: postgis-users-bounces at postgis.refractions.net
[mailto:postgis-users-bounces at postgis.refractions.net] On Behalf Of Ravi ada
Sent: Wednesday, November 30, 2011 7:06 AM
To: 'PostGIS Users Discussion'
Subject: Re: [postgis-users] Batch geocoding (~2 mil addresses)

Thanks Jonathan for the quick reply.

 

I am doing the similar thing as you did except doing it locally. I got the
address divided into 4 columns, address, city, state and zip. I was able to
query for each state and run the geocode. I was never successful running the
complete state. The query takes forever to finish even though the query plan
shows it is using index scan etc.  I even wrote a psql function to pass one
address at a time to get the geocode, this one also could not complete
because I could not commit each address as I get the gecode and ended up
running out of resources.

How you were able to geocode by state, did you write any scripts or psql
functions? If yes, do you mind sharing? I am having hard time geocoding
these addresses.

Appreciate any help is extended.

 

Thanks,

Ravi Ada

 

From: postgis-users-bounces at postgis.refractions.net
[mailto:postgis-users-bounces at postgis.refractions.net] On Behalf Of
Johnathan Leppert
Sent: Tuesday, November 29, 2011 11:38 PM
To: PostGIS Users Discussion
Subject: Re: [postgis-users] Batch geocoding (~2 mil addresses)

 

Hi Ravi,

 

There is a thread about the tiger geocoder performance from awhile back you
might want to look into. I had to do this before with about 1M addresses and
ended up creating an ec2 instance for each state and sending addresses with
that state to that specific host (using a simple regex to match state in an
address line). Somewhat expensive, but got the job done quickly.

 

Johnathan

 

On Tue, Nov 29, 2011 at 9:18 PM, Ravi ada <raviada at dobeyond.com> wrote:

Hello All,

 

I have been fighting with PostGIS for the last week or two to geocode about
2 million addresses. Here are the challenges that I am facing.

1)      I sorted the addresses by zip, address, city to group the similar
addresses together.

2)      So far I have seen only 50% of rating zero, that means perfect
match.

3)      Some addresses have Suite# or Apt# in address line 1. Geocode
function takes longer time for such addresses

4)      Addresses that are on the interstate, may be on the service road.
These are taking longer to code.

5)      Getting syntax errors on line_segment function when processing some
addresses that have additional data in parenthesis and a period. For example
"5727 FM 3097(HORIZON RD.),ROCKWALL,TX,750327786" 

6)      I am using batch updates selecting the addresses per state to reduce
the size of the batch. I used straight SQL update as mentioned in the
documentation, and wrote a function to call geo code for each address. In
both cases the query ran for more than 6 hours and never been able finish
successfully. If I kill the query, nothing gets updated to the address table
because the whole transaction is killed and not committed.

7)      I am running this on Ubuntu 11.10 and postgresql 9.1 on a virtual
machine on 8GB i5 machine. I know the resources are very less for such a
task.

 

My questions are

1)      How other people are geocoding so many addresses, what is the best
way to do it, knowing that we may not have fully normalized addresses.

2)      How long it should take to geocode about 2mil addresses.

3)      What does it mean if the rating is non-zero, and more than 50. Do we
get the distance wring for such addresses?

4)      What is the recommended hardware configuration? We need to geocode
these addresses only once, and periodically to the delta that we get every
month. 

 

Thank you so much for all the efforts that you guys are contributing to the
project. Any help with my questions would greatly be appreciated.

 

Thanks

Ravi Ada

 

 


_______________________________________________
postgis-users mailing list
postgis-users at postgis.refractions.net
http://postgis.refractions.net/mailman/listinfo/postgis-users





 

-- 
Johnathan

Software Engineer

San Francisco, California

Follow me on Twitter: @iamleppert <http://twitter.com/#!/iamleppert> 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20111212/344edef3/attachment.html>


More information about the postgis-users mailing list