[postgis-devel] Query Performance Question

Thu Oct 13 13:27:42 PDT 2016

Hi all,

I'm working on improving the performance of the new address standardizer 
and I'm trying to figure out what is the "secret sauce" to make the 
queries fastest.

My function is declared below for reference. Here are the timing on 
various queries:

Query 1:
as_test=# select * from as_standardize(
     '123 oak ln, st marie ny usa',
     (select grammar from as_config where countrycode='us'),
     (select lexicon from as_config where countrycode='us'),
     'en_US',
     (select filter from as_config where countrycode='us')
);
Time: 593.482 ms

Query 2:
as_test=# select (as_standardize(
     '123 oak ln, st marie ny usa',
     (select grammar from as_config where countrycode='us'),
     (select lexicon from as_config where countrycode='us'),
     'en_US',
     (select filter from as_config where countrycode='us'))
).*;
Time: 9610.006 ms

The function return 16 columns and 16 * .5935 sec ~= 9.610 sec

Query 3:
as_test=# with cfg as (
    select '123 oak ln, st marie ny usa'::text as addr, *
      from as_config where countrycode='us'
)
select addr, (as_standardize(addr, grammar, clexicon, 'en_US', 
filter)).* from cfg;
Time: 2533.146 ms

The real problem is when I want to standardize a whole table like:

Query 4:
as_test=# with cfg as (
    select * from as_config where countrycode='us'
)
select t.id, t.addr, (as_standardize(t.addr, cfg.grammar,
                                 cfg.clexicon, 'en_US', cfg.filter)).*
   from tmp_addresses t, cfg
   order by t.id;
Time: 46537.589 ms

The tmp_addresses table has 53 records so its taking on average 0.878 
sec / record.
The as_config table has 25 records in it, but we only use 1.

I've already implemented a query level cache for the grammar and lexicon 
objects which are expensive to construct and that improved performance.

I'm having trouble reconciling the various timings. It is possible that 
the sum of the timings for the 53 records really is about 46.5 sec but 
that does not explain the differences between Query 1, 2, and 3. The 
difference between Query 1 and 2 seems obvious.

Is there a better way to standardize a whole table of addresses? These 
issues aside, I'm still looking at ways to improve the performance of 
the underlying C++ code.

The code can be found here if your interested:
https://github.com/woodbri/address-standardizer
http://imaptools.com:8080/dl/address-standardizer2/ (windows 64-bit)

Thoughts appreciated,
   -Steve

CREATE OR REPLACE FUNCTION as_standardize(
         address text,
         grammar text,
         lexicon text,
         locale text,
         filter text,
         OUT building text,
         OUT house_num text,
         OUT predir text,
         OUT qual text,
         OUT pretype text,
         OUT name text,
         OUT suftype text,
         OUT sufdir text,
         OUT ruralroute text,
         OUT extra text,
         OUT city text,
         OUT prov text,
         OUT country text,
         OUT postcode text,
         OUT box text,
         OUT unit text
         )
     RETURNS RECORD
     AS '$libdir/address_standardizer2-2.0', 'as_standardize'
     LANGUAGE 'c' STABLE STRICT;

I have also tried IMMUTABLE instead of STABLE but I get the same timing 
results.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus