[postgis-users] Multi-server, multi-core reverse_geocode processing

Wed Dec 31 07:16:01 PST 2014

File locking and management is performed by bash scripts. Some prng is
built into the scripts to prevent systems from fetching at the same exact
second. The scripts perform clean up as well.

Network calls are not an issue as the EC2 boxes have 10GB interfaces. Upon
completion the systems drop out their files via postgres COPY in CSV
format. The script places the completed files back into s3.

The completed files location is checked hourly by an ETL process that
ingests the geocoded data. For our needs we've actually had to expand to
six boxes and will likely need more in the future.
On Dec 30, 2014 12:46 PM, "David Jantzen" <djantzen at avenace.com> wrote:

> Hi Barry,
>
> Seems like a good approach from a scaling perspective. One risk is
> multiple processes fetching the same file from S3 and doing unnecessary
> work, or worse, creating duplicate records. Do you have a locking mechanism
> of some kind?
>
> Also:
> 1) Have you instrumented the system so that you can identify bottlenecks?
> For example, how much time is spent in network calls to S3 versus the
> database query? How fast is the round trip to the database and should
> requests be batched somehow? This will be helpful information when you’re
> debating whether you need to add more nodes to your cluster or optimize
> your process.
> 2) Can you recover from failure? You want to be able to restart a long
> running batch process without re-doing a ton of work. Do failed reverse
> lookups need to be recorded and reported to someone?
> 3) Can you monitor the system? Will you have any warning about exceeding
> the system’s capacity before it tanks? What happens when someone asks what
> percentage of the points you’re processing result in valid reverse lookups?
> Or how many come back with multiple possibiliities?
> 4) Do you have automated tests around your components? While it seems
> pretty simple on the surface, when you combine multiple machines, possible
> locking issues, network requests and database queries, there’s actually a
> lot that can go wrong. It will be much easier to debug and maintain if you
> have clearly delineated responsibiliities in the codebase and tests around
> each part.
>
> Hope that helps,
> David
>
> On Dec 22, 2014, at 9:49 AM, Barry McCall <barry.mccall+postgis at gmail.com>
> wrote:
>
> > I was just wondering if anyone was utilizing multiple servers with a
> large amount of cores to harness multi-threading for batch processing?
> >
> > I'm currently running a farm of 4 systems with 32 cores per system. A
> number of bash scripts handle grabbing files from an s3 bucket when files
> exist. The design is able to scale up as needed by simply spinning up a new
> image. After doing a POC this weekend I've obtained some benchmark numbers.
> I was able to successfully reverse geocode 46,826,673 lon/lats in an
> average of 33 hours. (Approx 353,834 records per hour, per-machine,
> 1,415,337 per hour across the farm)
> >
> > Was just curious if anyone else had designed something like this and
> what your design methodologies were?!
> >
> > I've only posted a few questions but wanted to contribute what I had
> going on :-)
> >
> > Thanks again for all the previous help!
> >
> > -Barry
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users at lists.osgeo.org
> > http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> postgis-users at lists.osgeo.org
> http://lists.osgeo.org/cgi-bin/mailman/listinfo/postgis-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osgeo.org/pipermail/postgis-users/attachments/20141231/c4355e0a/attachment.html>