[postgis-tickets] [PostGIS] #2260: Benchmarking speed between built-in tiger normalizer and pagc_address_parser

Tue Apr 30 19:20:37 PDT 2013

#2260: Benchmarking speed between built-in tiger normalizer and
pagc_address_parser
----------------------------------+-----------------------------------------
  Reporter:  robe                 |       Owner:  robe         
      Type:  task                 |      Status:  reopened     
  Priority:  medium               |   Milestone:  PostGIS 2.1.0
 Component:  pagc_address_parser  |     Version:  trunk        
Resolution:                       |    Keywords:               
----------------------------------+-----------------------------------------
Changes (by robe):

  * status:  closed => reopened
  * resolution:  fixed =>

Comment:

 gosh darn it.  Switching it to pure sql did not fix the issue. Still
 considers each call a separate query and creates a std object for each.

 Normally this would be fine except to completely swap out I have to merge
 parse_address and standardize_address into a single query and that seems
 like a lot to ask of people.

 {{{ -- if I write the equivalent query like this:
 WITH p AS (
  SELECT parse_address(address) As var_parse_rec
  FROM test_parse)
 SELECT (to_number(substring( (var_rec).house_num, '[0-9]+'),
 '99999999999')
    ,trim( (var_rec).predir)
         , trim((var_rec).name)
                 ,trim(COALESCE((var_rec).suftype, (var_rec).pretype))
         , trim((var_rec).sufdir)
         , trim((var_rec).unit)
         ,trim((var_rec).city)
         , trim((var_rec).state)
         , (var_rec).postcode
         , true)::norm_addy
         FROM (
  SELECT standardize_address('pagc_lex'
        , 'pagc_gaz'
        , 'pagc_rules'
 , COALESCE((var_parse_rec).address1,''),
    COALESCE((var_parse_rec).city || ', ','') ||
 COALESCE((var_parse_rec).state || ' ', '') ||
 COALESCE((var_parse_rec).zip,'') ) As var_rec
    FROM p) AS s;

 }}}

 Takes 46 ms vs.

 {{{
 CREATE OR REPLACE FUNCTION tiger.pagc_normalize_address2(in_rawinput
 character varying)
   RETURNS norm_addy AS
 $$

    SELECT (to_number(substring( (var_rec).house_num, '[0-9]+'),
 '99999999999')
    ,trim( (var_rec).predir)
         , trim((var_rec).name)
                 ,trim(COALESCE((var_rec).suftype, (var_rec).pretype))
         , trim((var_rec).sufdir)
         , trim((var_rec).unit)
         ,trim((var_rec).city)
         , trim((var_rec).state)
         , (var_rec).postcode
         , true)::norm_addy
         FROM (
  SELECT standardize_address('pagc_lex'
        , 'pagc_gaz'
        , 'pagc_rules'
 , COALESCE(var_parse_rec.address1,''),
    COALESCE(var_parse_rec.city || ', ','') || COALESCE(var_parse_rec.state
 || ' ', '') || COALESCE(var_parse_rec.zip,'') ) As var_rec
    FROM parse_address($1) As var_parse_rec ) AS s

 $$
   LANGUAGE sql
   COST 100;

 }}}

 {{{ -- takes 156 ms (6 times as long)
 SELECT address, pagc_normalize_address2(address)
 FROM test_parse;
 }}}

 The main issue , to be able to do a drop in replace, people just pass the
 address as a single field, so you I  need a version of the standardizer
 that does both a parse and standardize in one call so people don't have to
 struggle with a complicated query.  This might be a good thing to have
 anyway as I imagine a lot of people will have the addresss as a single
 field.

 Then all my function would need to do is define a cast between std and
 norm_addy or I'd just write a geocode function that takes std instead of
 norm_addy.

 I'm reopening this since I think its hard for me to fix without changes to
 pagc_address_parser, and will be an issue for others wanting to wrap the
 function to wrap it efficiently.

-- 
Ticket URL: <http://trac.osgeo.org/postgis/ticket/2260#comment:36>
PostGIS <http://trac.osgeo.org/postgis/>
The PostGIS Trac is used for bug, enhancement & task tracking, a user and developer wiki, and a view into the subversion code repository of PostGIS project.